class: center, middle, inverse, title-slide # Model Comparison ### Thierry Warin, PhD ### quantum simulations
*
--- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Navigation tips - Tile view: Just press O (the letter O for Overview) at any point in your slideshow and the tile view appears. Click on a slide to jump to the slide, or press O to exit tile view. - Draw: Click on the pen icon (top right of the slides) to start drawing. - Search: click on the loop icon (bottom left of the slides) to start searching. You can also click on h at any moments to have more navigations tips. --- class: inverse, center, middle # Outline --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% ### outline 1. Comparing models --- class: inverse, center, middle # Comparing models --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Comparing models Let’s consider 2 models **to predict the mean selling price of homes** in the real estate data set: Model A: `$$\mu = \beta_0 + \beta_1(beds)$$` Model B: `$$\mu = \beta_0 + \beta_1(beds) + \beta_2(baths) + \beta_3(area)$$` > What is the main difference in the interpretation of `\(\beta_1\)` in each of these models? <details> <summary>Click here for answer</summary> It is the effect of beds on mean price, controlling for number of baths and living area in Model B (uncontrolled in Model A). </details> --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Comparing models ```r summary(modelA<-lm(Price~bedrooms, data = re)) ``` ``` ## ## Call: ## lm(formula = Price ~ bedrooms, data = re) ## ## Residuals: ## Min 1Q Median 3Q Max ## -450481 -210012 -77595 158854 1030556 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 338975 44117 7.684 4.07e-14 *** ## bedrooms 40234 11499 3.499 0.00049 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 286700 on 892 degrees of freedom ## Multiple R-squared: 0.01354, Adjusted R-squared: 0.01243 ## F-statistic: 12.24 on 1 and 892 DF, p-value: 0.00049 ``` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Comparing models ```r summary(modelB<-lm(Price~bedrooms+bathrooms+Living.area, data = re)) ``` ``` ## ## Call: ## lm(formula = Price ~ bedrooms + bathrooms + Living.area, data = re) ## ## Residuals: ## Min 1Q Median 3Q Max ## -602343 -188383 -60367 144332 1133878 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 289913.73 40344.96 7.186 1.41e-12 *** ## bedrooms -67709.57 13120.94 -5.160 3.04e-07 *** ## bathrooms 84762.80 13175.46 6.433 2.04e-10 *** ## Living.area 89.69 13.26 6.764 2.43e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 261000 on 890 degrees of freedom ## Multiple R-squared: 0.1842, Adjusted R-squared: 0.1815 ## F-statistic: 67 on 3 and 890 DF, p-value: < 2.2e-16 ``` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Comparing models Notice the coefficient for bedrooms switched signs, and is significant in both models. > What does this tell us? <details> <summary>Click here for answer</summary> Number of bedrooms is correlated with the other independent variable. This is called multicollinearity. </details> > So which model is ‘correct’? <details> <summary>Click here for answer</summary> Both! There is really no such thing as a single correct model, only useful models. And both models are useful. - Model A suggests that adding a bedroom to a house is associated with a higher selling price. - Model B suggests that adding a bedroom without adding bathrooms or to the size of the living area is associated with a lower selling price. Both models provide insight into how these variables relate. </details> --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Comparing models Measure of strength: `\(R^2\)` is a useful measure to use for this purpose. Recall: R2 measures the proportion of variability in the dependent variable variable accounted for by the independent variable. But `\(R^2\)` has limitations: - It is a good measure to describe the sample of data at hand, but underestimates error in predicting future observations. - Adding in a useless variable can only increase `\(R^2\)`. Other measures are more useful for comparing models. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Comparing models `\(F-test\)` to compare **nested models**. > Two models are nested if all of the variables in the smaller model are also in the larger model. To formally compare two nested models, an F-test can be used to determine if the added variables provide extra predictive power. The **F-test** is essentially testing if the improvement in `\(R^2\)` is significant (better than chance alone: if the added variables were truly unrelated to the dependent variable). --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Comparing models In the multiple linear regression model setup, the F-test measures the potential relationship between the dependent variable and ALL the independent variables: `$$H_0: \beta_1=\beta_2=...=\beta_n=0$$` `$$H_1:~at~least~one~\beta_j~is~non~zero.$$` Thus hypothesis test is performed by computing the `\(F-statistics\)`: `$$F=\frac{(TSS-RSS)}{RSS} = \frac{TSS}{RSS} - 1$$` where `\(TSS=\sum(y_i-\bar{y})^2\)` and `\(RSS=\sum(y_i-\hat{y})^2\)`. It is an analysis in fact of variance. Well, guess what? There is way to do it. <details> <summary>Click here for answer</summary> The F-test is based on something called the analysis of variance (ANOVA) of the residuals in regression. </details> --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Comparing models If the p-value is small, then there is evidence to suggest that the extra variables are important. Comparing Model A and Model B from before in R: ```r anova(modelA, modelB) ``` ``` ## Analysis of Variance Table ## ## Model 1: Price ~ bedrooms ## Model 2: Price ~ bedrooms + bathrooms + Living.area ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 892 7.3341e+13 ## 2 890 6.0650e+13 2 1.269e+13 93.112 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` <details> <summary>Click here for answer</summary> The F-test has a very small p-value (certainly less than 0.0001). We can conclude that at least one of the two added variables, bathrooms and Living.area, provide added predictive ability. </details> --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Comparing models If two models are **not nested**, then the Akaike’s Information Criterion (AIC) is often used to determine which model is a better predictive one. The basic idea is that it adds a penalty factor to `\(1 – R^2\)` of a regression model based on the number of independent variable in the model. The lower the AIC for a model, the better. Let’s consider a third model to predict the mean selling price of homes: `$$\mu + \beta_0 + \beta(area) + \beta_2(area^2) + \beta_3(bathrooms) + \beta_4(bathrooms^2)$$` Model C is not nested with either Model A (using bedrooms as a predictor alone) or Model B (using bedrooms, bathrooms, and area as independent variable). --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Comparing models ```r knitr::include_graphics("./images/summary1.png") ``` <img src="./images/summary1.png" width="80%" style="display: block; margin: auto;" /> --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Comparing models Here is the R output for calculating AIC for these 3 models: ```r AIC(modelA, modelB, modelC) ``` ``` ## df AIC ## modelA 3 25009.67 ## modelB 5 24843.81 ## modelC 6 24845.16 ``` <details> <summary>Click here for answer</summary> Model B has the lowest AIC, Model A has the highest AIC, and Model C is in between. Model B would be chosen as the best predictive model. </details> Note: ‘df’ here represents the number of independent variable + 2. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Comparing models We use AIC because it is a useful measure of how predictive a model is. AIC is not perfect, and is an approximation of deep mathematical theory of information. There are many other measures that sometimes get used in its place for **out-of-sample predictive** accuracy, and each has its own merits (BIC, Mallow's CP, etc.).