Linear Regressions Interpretation

class: center, middle, inverse, title-slide

# Linear Regressions Interpretation
### Thierry Warin, PhD
### quantum simulations<a style="color:#6f97d0">*</a>

---

background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%

# Navigation tips

- Tile view: Just press O (the letter O for Overview) at any point in your slideshow and the tile view appears. Click on a slide to jump to the slide, or press O to exit tile view.

- Draw: Click on the pen icon (top right of the slides) to start drawing.

- Search: click on the loop icon (bottom left of the slides) to start searching.

You can also click on h at any moments to have more navigations tips.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Introduction

The goals are to learn to o quantify the relationship between two numerical variables, as well as modeling **numerical dependent** variables using a **numerical or categorical independent** variable.

---
class: inverse, center, middle

# Outline

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%

### outline

1. Estimating the coefficients

2. Assessing the accuracy

3. Assessing the validity

---
class: inverse, center, middle

# Estimating the coefficients

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Estimating the coefficients

```r
# I am reading the dataset collected from Github: https://github.com/warint/qmibr
poverty <- readr::read_tsv("https://warin.ca/datalake/courses_data/qmibr/session4/poverty.txt")

head(poverty,3)
```

```
## # A tibble: 3 x 6
##   State  `Metropolitan Resid… White Graduates Poverty PercentFemaleHouseholderN…
##   <chr>                 <dbl> <dbl>     <dbl>   <dbl>                      <dbl>
## 1 Alaba…                 55.4  71.3      79.9    14.6                       14.2
## 2 Alaska                 65.6  70.8      90.6     8.3                       10.8
## 3 Arizo…                 88.2  87.7      83.8    13.3                       11.1
```

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Estimating the coefficients

```r
summary(poverty)
```

```
##     State           Metropolitan Residence     White         Graduates    
##  Length:51          Min.   : 38.20         Min.   :25.90   Min.   :77.20  
##  Class :character   1st Qu.: 60.80         1st Qu.:76.80   1st Qu.:83.30  
##  Mode  :character   Median : 71.60         Median :85.40   Median :86.90  
##                     Mean   : 72.25         Mean   :81.72   Mean   :86.01  
##                     3rd Qu.: 86.80         3rd Qu.:90.25   3rd Qu.:88.70  
##                     Max.   :100.00         Max.   :97.10   Max.   :92.10  
##     Poverty      PercentFemaleHouseholderNoHusbandPresent
##  Min.   : 5.60   Min.   : 7.80                           
##  1st Qu.: 9.25   1st Qu.: 9.55                           
##  Median :10.60   Median :11.80                           
##  Mean   :11.35   Mean   :11.63                           
##  3rd Qu.:13.40   3rd Qu.:12.65                           
##  Max.   :18.00   Max.   :18.90
```

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Estimating the coefficients

`$$Poverty = \beta_{0} + \beta_{1} \times Graduates + \epsilon$$`

`$$\hat{y} = \beta_0 + \beta_1 x$$`

- `$\hat{y}$`: Predicted value of the dependent variable, `$y$`

- `$\beta_0$`: Intercept, parameter

- `$b_0$`: Intercept, point estimate

- `$\beta_1$`: Slope, parameter

- `$b_1$`: Slope, point estimate

- `$x$`: independent variable

```r
model1 <- lm(poverty$Poverty ~ poverty$Graduates)
```

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Estimating the coefficients

```r
summary(model1)
```

```
## 
## Call:
## lm(formula = poverty$Poverty ~ poverty$Graduates)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1624 -1.2593 -0.2184  0.9611  5.4437 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       64.78097    6.80260   9.523 9.94e-13 ***
## poverty$Graduates -0.62122    0.07902  -7.862 3.11e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.082 on 49 degrees of freedom
## Multiple R-squared:  0.5578,	Adjusted R-squared:  0.5488 
## F-statistic: 61.81 on 1 and 49 DF,  p-value: 3.109e-10
```

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Estimating the coefficients
The linear model for predicting poverty from high school graduation rate in the US is:

`$$\widehat{poverty} = 64.78 - 0.62 * HSgrad$$`

The `$\widehat{hat}$` is used to signify that this is an estimate.

```r
poverty[poverty$State == "Georgia",]
```

```
## # A tibble: 1 x 6
##   State  `Metropolitan Resid… White Graduates Poverty PercentFemaleHouseholderN…
##   <chr>                 <dbl> <dbl>     <dbl>   <dbl>                      <dbl>
## 1 Georg…                 71.6  67.5      85.1    12.1                       14.5
```

The high school graduate rate in Georgia is 85.1%.

> What poverty level does the model predict for this state?

`$$64.78 - 0.62 * 85.1 = 12.018$$`

---
class: inverse, center, middle

# Assessing the accuracy

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Assessing the accuracy

We can assess the accuracy of

- the coefficient estimates [individually]

- the model [the whole]

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Assessing the accuracy

.pull-left[

### Residual Standard Error

Residuals are the leftovers from the model fit: Data = Fit + Residual

]

.pull-right[

![](concept4_files/figure-html/unnamed-chunk-7-1.png)

]

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Assessing the accuracy

.pull-left[

A residual is the difference between the observed `$y_i$` and predicted `$\hat{y}_i$`.

`$$\epsilon_i = y_i - \hat{y}_i$$`

-  living in poverty in DC is 5.44% more than predicted.

-  living in poverty in RI is 4.16% less than predicted.

]

.pull-right[

![](concept4_files/figure-html/unnamed-chunk-8-1.png)

]

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Assessing the accuracy

We want to quantify the *extent to which the model fits the data*.

For that, we use:

- the *residual standard error*, and

- the `$R^2$` statistic.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Assessing the accuracy

The RSE is an estimate of the standard deviation of `$\epsilon$`.

It is the average amount that the response will deviate from the true regression line:

`$$RSE = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}{n-2}}$$`

`$$RSS= \sum_{i=1}^{n}(y_i-\hat{y}_i)^2$$`

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Assessing the accuracy

> Definition: The `$R^2$` statistic is the proportion of variance explained.

To calculate `$R^2$`, we use the formula:

`$$R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}$$`

where `$TSS = \sum_{i=1}^{n}(y_i - \bar{y})^2$`. TSS is the **total sum of squares**.

- TSS measures the total variance in the response Y.

- It tells us what percent of variability in the dependent variable is explained by the model.

- The remainder of the variability is explained by variables not included in the model or by inherent randomness in the data.

---
class: inverse, center, middle

# Assessing the validity

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Assessing the validity

- (1) Linearity

- (2) Nearly normal residuals

- (3) Constant variability

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Assessing the validity, (1) Linearity

- The relationship between the explanatory and the dependent variable should be linear.

> Check using a scatterplot of the data, or a residuals plot.

.center[

![](concept4_files/figure-html/unnamed-chunk-9-1.png)

]

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Assessing the validity, (1) Linearity

### Anatomy of a residuals plot

.pull-left[

`$$\%~HSgrad = 81\%~\&~\%~in~poverty = 10.3$$`

`$$\widehat{\%~in~poverty} = 64.68 - 0.62 \times 81 = 14.46$$`

`$$\epsilon = {\%~in~poverty} - \widehat{\%~in~poverty}$$`

`$$\epsilon = 10.3 - 14.46 = -4.16$$`

`$$\%~HSgrad = 86\%~\&~\%~in~poverty = 16.8$$`

`$$\widehat{\%~in~poverty} = 64.68 - 0.62 \times 86 = 11.36$$`

`$$\epsilon = {\%~in~poverty} - \widehat{\%~in~poverty}$$`

`$$\epsilon = 16.8 - 11.36 = 5.44$$`

]

.pull-right[

![](concept4_files/figure-html/unnamed-chunk-11-1.png)

]

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
## Assessing the validity, (2) Nearly normal residuals

.pull-left[

- The residuals should be nearly normal.

- This condition may not be satisfied when there are unusual observations that don't follow the trend of the rest of the data.

> Check using a histogram.

]

.pull-right[

![](concept4_files/figure-html/unnamed-chunk-12-1.png)

]

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
## Assessing the validity, (3) Constant variability

.pull-left[

- The variability of points around the least squares line should be roughly constant.

- This implies that the variability of residuals around the 0 line should be roughly constant as well.

- Also called homoskedasticity.

> Check using a residuals plot.

]

.pull-right[

![](concept4_files/figure-html/unnamed-chunk-13-1.png)

]