class: center, middle, inverse, title-slide # Linear Regressions Interpretation ### Thierry Warin, PhD ### quantum simulations
*
--- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Navigation tips - Tile view: Just press O (the letter O for Overview) at any point in your slideshow and the tile view appears. Click on a slide to jump to the slide, or press O to exit tile view. - Draw: Click on the pen icon (top right of the slides) to start drawing. - Search: click on the loop icon (bottom left of the slides) to start searching. You can also click on h at any moments to have more navigations tips. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Introduction The goals are to learn to o quantify the relationship between two numerical variables, as well as modeling **numerical dependent** variables using a **numerical or categorical independent** variable. --- class: inverse, center, middle # Outline --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% ### outline 1. Estimating the coefficients 2. Assessing the accuracy 3. Assessing the validity --- class: inverse, center, middle # Estimating the coefficients --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Estimating the coefficients ```r # I am reading the dataset collected from Github: https://github.com/warint/qmibr poverty <- readr::read_tsv("https://warin.ca/datalake/courses_data/qmibr/session4/poverty.txt") head(poverty,3) ``` ``` ## # A tibble: 3 x 6 ## State `Metropolitan Resid… White Graduates Poverty PercentFemaleHouseholderN… ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Alaba… 55.4 71.3 79.9 14.6 14.2 ## 2 Alaska 65.6 70.8 90.6 8.3 10.8 ## 3 Arizo… 88.2 87.7 83.8 13.3 11.1 ``` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Estimating the coefficients ```r summary(poverty) ``` ``` ## State Metropolitan Residence White Graduates ## Length:51 Min. : 38.20 Min. :25.90 Min. :77.20 ## Class :character 1st Qu.: 60.80 1st Qu.:76.80 1st Qu.:83.30 ## Mode :character Median : 71.60 Median :85.40 Median :86.90 ## Mean : 72.25 Mean :81.72 Mean :86.01 ## 3rd Qu.: 86.80 3rd Qu.:90.25 3rd Qu.:88.70 ## Max. :100.00 Max. :97.10 Max. :92.10 ## Poverty PercentFemaleHouseholderNoHusbandPresent ## Min. : 5.60 Min. : 7.80 ## 1st Qu.: 9.25 1st Qu.: 9.55 ## Median :10.60 Median :11.80 ## Mean :11.35 Mean :11.63 ## 3rd Qu.:13.40 3rd Qu.:12.65 ## Max. :18.00 Max. :18.90 ``` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Estimating the coefficients `$$Poverty = \beta_{0} + \beta_{1} \times Graduates + \epsilon$$` `$$\hat{y} = \beta_0 + \beta_1 x$$` - `\(\hat{y}\)`: Predicted value of the dependent variable, `\(y\)` - `\(\beta_0\)`: Intercept, parameter - `\(b_0\)`: Intercept, point estimate - `\(\beta_1\)`: Slope, parameter - `\(b_1\)`: Slope, point estimate - `\(x\)`: independent variable ```r model1 <- lm(poverty$Poverty ~ poverty$Graduates) ``` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Estimating the coefficients ```r summary(model1) ``` ``` ## ## Call: ## lm(formula = poverty$Poverty ~ poverty$Graduates) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.1624 -1.2593 -0.2184 0.9611 5.4437 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 64.78097 6.80260 9.523 9.94e-13 *** ## poverty$Graduates -0.62122 0.07902 -7.862 3.11e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.082 on 49 degrees of freedom ## Multiple R-squared: 0.5578, Adjusted R-squared: 0.5488 ## F-statistic: 61.81 on 1 and 49 DF, p-value: 3.109e-10 ``` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Estimating the coefficients The linear model for predicting poverty from high school graduation rate in the US is: `$$\widehat{poverty} = 64.78 - 0.62 * HSgrad$$` The `\(\widehat{hat}\)` is used to signify that this is an estimate. ```r poverty[poverty$State == "Georgia",] ``` ``` ## # A tibble: 1 x 6 ## State `Metropolitan Resid… White Graduates Poverty PercentFemaleHouseholderN… ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Georg… 71.6 67.5 85.1 12.1 14.5 ``` The high school graduate rate in Georgia is 85.1%. > What poverty level does the model predict for this state? `$$64.78 - 0.62 * 85.1 = 12.018$$` --- class: inverse, center, middle # Assessing the accuracy --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Assessing the accuracy We can assess the accuracy of - the coefficient estimates [individually] - the model [the whole] --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Assessing the accuracy .pull-left[ ### Residual Standard Error Residuals are the leftovers from the model fit: Data = Fit + Residual ] .pull-right[ ![](concept4_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Assessing the accuracy .pull-left[ A residual is the difference between the observed `\(y_i\)` and predicted `\(\hat{y}_i\)`. `$$\epsilon_i = y_i - \hat{y}_i$$` - living in poverty in DC is 5.44% more than predicted. - living in poverty in RI is 4.16% less than predicted. ] .pull-right[ ![](concept4_files/figure-html/unnamed-chunk-8-1.png)<!-- --> ] --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Assessing the accuracy We want to quantify the *extent to which the model fits the data*. For that, we use: - the *residual standard error*, and - the `\(R^2\)` statistic. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Assessing the accuracy The RSE is an estimate of the standard deviation of `\(\epsilon\)`. It is the average amount that the response will deviate from the true regression line: `$$RSE = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}{n-2}}$$` `$$RSS= \sum_{i=1}^{n}(y_i-\hat{y}_i)^2$$` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Assessing the accuracy > Definition: The `\(R^2\)` statistic is the proportion of variance explained. To calculate `\(R^2\)`, we use the formula: `$$R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}$$` where `\(TSS = \sum_{i=1}^{n}(y_i - \bar{y})^2\)`. TSS is the **total sum of squares**. - TSS measures the total variance in the response Y. - It tells us what percent of variability in the dependent variable is explained by the model. - The remainder of the variability is explained by variables not included in the model or by inherent randomness in the data. --- class: inverse, center, middle # Assessing the validity --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Assessing the validity - (1) Linearity - (2) Nearly normal residuals - (3) Constant variability --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Assessing the validity, (1) Linearity - The relationship between the explanatory and the dependent variable should be linear. > Check using a scatterplot of the data, or a residuals plot. .center[ ![](concept4_files/figure-html/unnamed-chunk-9-1.png)<!-- --> ] --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Assessing the validity, (1) Linearity ### Anatomy of a residuals plot .pull-left[ `$$\%~HSgrad = 81\%~\&~\%~in~poverty = 10.3$$` `$$\widehat{\%~in~poverty} = 64.68 - 0.62 \times 81 = 14.46$$` `$$\epsilon = {\%~in~poverty} - \widehat{\%~in~poverty}$$` `$$\epsilon = 10.3 - 14.46 = -4.16$$` `$$\%~HSgrad = 86\%~\&~\%~in~poverty = 16.8$$` `$$\widehat{\%~in~poverty} = 64.68 - 0.62 \times 86 = 11.36$$` `$$\epsilon = {\%~in~poverty} - \widehat{\%~in~poverty}$$` `$$\epsilon = 16.8 - 11.36 = 5.44$$` ] .pull-right[ ![](concept4_files/figure-html/unnamed-chunk-11-1.png)<!-- --> ] --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% ## Assessing the validity, (2) Nearly normal residuals .pull-left[ - The residuals should be nearly normal. - This condition may not be satisfied when there are unusual observations that don't follow the trend of the rest of the data. > Check using a histogram. ] .pull-right[ ![](concept4_files/figure-html/unnamed-chunk-12-1.png)<!-- --> ] --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% ## Assessing the validity, (3) Constant variability .pull-left[ - The variability of points around the least squares line should be roughly constant. - This implies that the variability of residuals around the 0 line should be roughly constant as well. - Also called homoskedasticity. > Check using a residuals plot. ] .pull-right[ ![](concept4_files/figure-html/unnamed-chunk-13-1.png)<!-- --> ]