Model Comparison

class: center, middle, inverse, title-slide

# Model Comparison
### Thierry Warin, PhD
### quantum simulations<a style="color:#6f97d0">*</a>

---

background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%

# Navigation tips

- Tile view: Just press O (the letter O for Overview) at any point in your slideshow and the tile view appears. Click on a slide to jump to the slide, or press O to exit tile view.

- Draw: Click on the pen icon (top right of the slides) to start drawing.

- Search: click on the loop icon (bottom left of the slides) to start searching.

You can also click on h at any moments to have more navigations tips.

---
class: inverse, center, middle

# Outline

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%

### outline

1. Comparing models

---
class: inverse, center, middle

# Comparing models

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Comparing models

Let’s consider 2 models **to predict the mean selling price of homes** in the real estate data set:

Model A:

`$$\mu = \beta_0 + \beta_1(beds)$$`

Model B:

`$$\mu = \beta_0 + \beta_1(beds) + \beta_2(baths) + \beta_3(area)$$`

> What is the main difference in the interpretation of `$\beta_1$` in each of these models?

<details>
<summary>Click here for answer</summary>

It is the effect of beds on mean price, controlling for number of baths and living area in Model B (uncontrolled in Model A).

</details>

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Comparing models

```r
summary(modelA<-lm(Price~bedrooms, data = re))
```

```
## 
## Call:
## lm(formula = Price ~ bedrooms, data = re)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -450481 -210012  -77595  158854 1030556 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   338975      44117   7.684 4.07e-14 ***
## bedrooms       40234      11499   3.499  0.00049 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286700 on 892 degrees of freedom
## Multiple R-squared:  0.01354,	Adjusted R-squared:  0.01243 
## F-statistic: 12.24 on 1 and 892 DF,  p-value: 0.00049
```

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Comparing models

```r
summary(modelB<-lm(Price~bedrooms+bathrooms+Living.area, data = re))
```

```
## 
## Call:
## lm(formula = Price ~ bedrooms + bathrooms + Living.area, data = re)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -602343 -188383  -60367  144332 1133878 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 289913.73   40344.96   7.186 1.41e-12 ***
## bedrooms    -67709.57   13120.94  -5.160 3.04e-07 ***
## bathrooms    84762.80   13175.46   6.433 2.04e-10 ***
## Living.area     89.69      13.26   6.764 2.43e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 261000 on 890 degrees of freedom
## Multiple R-squared:  0.1842,	Adjusted R-squared:  0.1815 
## F-statistic:    67 on 3 and 890 DF,  p-value: < 2.2e-16
```

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Comparing models

Notice the coefficient for bedrooms switched signs, and is significant in both models.

> What does this tell us?

<details>
<summary>Click here for answer</summary>

Number of bedrooms is correlated with the other independent variable.  This is called multicollinearity.

</details>

> So which model is ‘correct’?

<details>
<summary>Click here for answer</summary>

Both! There is really no such thing as a single correct model, only useful models.  And both models are useful.

- Model A suggests that adding a bedroom to a house is associated with a higher selling price.

- Model B suggests that adding a bedroom without adding bathrooms or to the size of the living area is associated with a lower selling price.

Both models provide insight into how these variables relate.

</details>

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Comparing models

Measure of strength:

`$R^2$` is a useful measure to use for this purpose. Recall: R2 measures the proportion of variability in the dependent variable variable accounted for by the independent variable.

But `$R^2$` has limitations:

- It is a good measure to describe the sample of data at hand, but underestimates error in predicting future observations.

- Adding in a useless variable can only increase `$R^2$`.

Other measures are more useful for comparing models.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Comparing models

`$F-test$` to compare **nested models**.

> Two models are nested if all of the variables in the smaller model are also in the larger model.

To formally compare two nested models, an F-test can be used to determine if the added variables provide extra predictive power.

The **F-test** is essentially testing if the improvement in `$R^2$` is significant (better than chance alone: if the added variables were truly unrelated to the dependent variable).

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Comparing models
In the multiple linear regression model setup, the F-test measures the potential relationship between the dependent variable and ALL the independent variables:

`$$H_0: \beta_1=\beta_2=...=\beta_n=0$$`

`$$H_1:~at~least~one~\beta_j~is~non~zero.$$`

Thus hypothesis test is performed by computing the `$F-statistics$`:

`$$F=\frac{(TSS-RSS)}{RSS} = \frac{TSS}{RSS} - 1$$`

where `$TSS=\sum(y_i-\bar{y})^2$` and `$RSS=\sum(y_i-\hat{y})^2$`.

It is an analysis in fact of variance. Well, guess what? There is way to do it.

<details>
<summary>Click here for answer</summary>

The F-test is based on something called the analysis of variance (ANOVA) of the residuals in regression.

</details>

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Comparing models

If the p-value is small, then there is evidence to suggest that the extra variables are important.

Comparing Model A and Model B from before in R:

```r
anova(modelA, modelB)
```

```
## Analysis of Variance Table
## 
## Model 1: Price ~ bedrooms
## Model 2: Price ~ bedrooms + bathrooms + Living.area
##   Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
## 1    892 7.3341e+13                                  
## 2    890 6.0650e+13  2 1.269e+13 93.112 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

<details>
<summary>Click here for answer</summary>

The F-test has a very small p-value (certainly less than 0.0001).

We can conclude that at least one of the two added variables, bathrooms and Living.area, provide added predictive ability.

</details>

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Comparing models

If two models are **not nested**, then the Akaike’s Information Criterion (AIC) is often used to determine which model is a better predictive one.

The basic idea is that it adds a penalty factor to `$1 – R^2$` of a regression model based on the number of independent variable in the model.

The lower the AIC for a model, the better.

Let’s consider a third model to predict the mean selling price of homes:

`$$\mu + \beta_0 + \beta(area) + \beta_2(area^2) + \beta_3(bathrooms) + \beta_4(bathrooms^2)$$`

Model C is not nested with either Model A (using bedrooms as a predictor alone) or Model B (using bedrooms, bathrooms, and area as independent variable).

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Comparing models

```r
knitr::include_graphics("./images/summary1.png")
```

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Comparing models

Here is the R output for calculating AIC for these 3 models:

```r
AIC(modelA, modelB, modelC)
```

```
##        df      AIC
## modelA  3 25009.67
## modelB  5 24843.81
## modelC  6 24845.16
```

<details>
<summary>Click here for answer</summary>

Model B has the lowest AIC, Model A has the highest AIC, and Model C is in between.

Model B would be chosen as the best predictive model.

</details>

Note: ‘df’ here represents the number of independent variable + 2.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Comparing models

We use AIC because it is a useful measure of how predictive a model is.

AIC is not perfect, and is an approximation of deep mathematical theory of information.

There are many other measures that sometimes get used in its place for **out-of-sample predictive** accuracy, and each has its own merits (BIC, Mallow's CP, etc.).