class: center, middle, inverse, title-slide # Logistic Regression ### Thierry Warin, PhD ### quantum simulations
*
--- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Navigation tips - Tile view: Just press O (the letter O for Overview) at any point in your slideshow and the tile view appears. Click on a slide to jump to the slide, or press O to exit tile view. - Draw: Click on the pen icon (top right of the slides) to start drawing. - Search: click on the loop icon (bottom left of the slides) to start searching. You can also click on h at any moments to have more navigations tips. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% As we saw in the previous concepts, linear regression allows us to model the relationship between a quantitative dependent variable Y and a set of independent variables. > In this concept we extend our modeling capabilities and learn how to model the relationship between a **binary dependent variable Y** and a set of independent variables. --- class: inverse, center, middle # Outline --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% ### outline 1. Logistic regression 2. Inference and goodness of fit --- class: inverse, center, middle # Logistic regression --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Logistic regression Recall that a binary dependent variable can take on only one of two values; e.g. `\(Y = 0\)` or `\(Y = 1\)`. Conventionally `\(1\)` represents a success and `\(0\)` a failure, regardless of the outcome. Examples of this type of variable in practice are: - `\(Y = 1\)` if defaulted on loan, 0 otherwise. - `\(Y = 1\)` if made online purchase, 0 otherwise. - `\(Y = 1\)` if renewed car lease, 0 otherwise. - `\(Y = 1\)` if employee promoted, 0 otherwise. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Logistic regression We showed in the previous segment that when a binary dependent variable is modeled using linear regression, the underlying model being estimated is `$$\pi = P(Y=1) = \beta_0 + \beta_1 x$$` As we saw, issues can arise when making predictions since this model does not prevent probabilities from being larger than 0 or less than 1. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Logistic regression Logistic regression is a way to model a binary dependent variable. - The logistic regression model states that: `$$\pi = P(Y=1) = \frac{e^{(\beta_0+\beta_1x)}}{1+e^{(\beta_0+\beta_1x)}}$$` `$$\pi = P(Y=1) = \frac{1}{1+e^{-(\beta_0+\beta_1x)}}$$` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Logistic regression .pull-left[ The weird looking function is called the logistic function, and its purpose is to ensure that predictions will always be between 0 to 1. It is called the S-shape function. ] .pull-right[ <img src="./images/logistic1.png" width="120%" style="display: block; margin: auto;" /> ] --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Logistic regression The logistic regression model can be estimated by the method of **maximum likelihood estimation**, a common approach for estimating parameters in statistical models. - The estimated probability that `\(Y = 1\)` given `\(x\)` is: `$$p = \frac{1}{1 + e^{-(b_0 + b_1 x)}}$$` - `\(b_0\)` is the estimate of `\(\beta_0\)`. - `\(b_1\)` is the estimate of `\(\beta_1\)`. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Logistic regression For our loan default data, the R output from estimating a logistic regression is as follows: .panelset[ .panel[.panel-name[R Code] ```r loans = readr::read_csv("https://www.warin.ca/datalake/courses_data/qmibr/session7/loans.csv") # rename the dependent variable for ease of use loans$default = loans$not.fully.paid m2 <- glm(default ~ fico, data = loans, family = "binomial") summary(m2) ``` ] .panel[.panel-name[output] <img src="./images/summarym2.PNG" width="400px" style="display: block; margin: auto;" /> ] ] --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Logistic regression Using the R output, the estimated equation for the probability of a loan default is: `$$p = \frac{1}{1 + e^{-(6.72 - 0.012 x)}}$$` With this equation we can predict probability of loan default for any FICO score `\(x\)`. - Using the estimated equation, if someone’s FICO score is 600 the probability of default is: `$$p = \frac{1}{1 + e^{-(6.72 - 0.012 \times 600)}} = 0.38$$` - If someone’s FICO score is 850 the probability of default is: `$$p = \frac{1}{1 + e^{-(6.72 - 0.012 \times 850)}} = 0.03$$` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Logistic regression <img src="concept7_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Inference and goodness of fit --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Inference and goodness of fit We learned how to estimate the coefficients of a simple logistic regression and obtain predictions. We now focus on: - **Hypothesis testing** for the coefficient of a predictor in logistic regression. - **Confidence intervals** for probabilities. - An **analogous measure** to `\(R^2\)` in linear regression. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Inference and goodness of fit Again, the simple logistic regression model is given by: `$$\pi = P(Y=1) = \frac{1}{1+e^{-(\beta_0+\beta_1x)}}$$` - We usually consider both confidence intervals and hypothesis tests for `\(\beta_1\)` (btw, not `\(b_1\)`). - Because the value of `\(\beta_1\)` is difficult to interpret, we will focus mainly on hypothesis testing for `\(\beta_1 = 0\)`. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Inference and goodness of fit ``` ## ## Call: ## glm(formula = default ~ fico, family = "binomial", data = loans) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.9544 -0.6518 -0.5538 -0.4185 2.4863 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 6.7188776 0.5768348 11.65 <2e-16 *** ## fico -0.0118775 0.0008231 -14.43 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 8424.0 on 9577 degrees of freedom ## Residual deviance: 8196.1 on 9576 degrees of freedom ## AIC: 8200.1 ## ## Number of Fisher Scoring iterations: 4 ``` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Inference and goodness of fit The estimate `\(b_1 = - 0.0119\)` is difficult to interpret exactly. But an important question is whether the data provide evidence that `\(\beta_1 \neq 0\)`, which would imply that FICO scores are related to the probability of a loan default. We could construct a confidence interval for `\(\beta_1\)` using the same approach as we did with linear regression. `$$b_1 \pm 1.96 \times SE(b_1)$$` - Usually difficult to make sense out of the exact set of values. - However, we can assess whether `\(\beta_1\)` is positive or negative through a confidence interval. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Inference and goodness of fit .pull-left[ ```r # load the data set and summarize the included variables loans = readr::read_csv("https://www.warin.ca/datalake/courses_data/qmibr/session7/loans.csv") # rename the dependent variable for ease of use loans$default = loans$not.fully.paid m2 <- glm(default ~ fico, data = loans, family = "binomial") summary(m2) ``` ``` ## ## Call: ## glm(formula = default ~ fico, family = "binomial", data = loans) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.9544 -0.6518 -0.5538 -0.4185 2.4863 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 6.7188776 0.5768348 11.65 <2e-16 *** ## fico -0.0118775 0.0008231 -14.43 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 8424.0 on 9577 degrees of freedom ## Residual deviance: 8196.1 on 9576 degrees of freedom ## AIC: 8200.1 ## ## Number of Fisher Scoring iterations: 4 ``` ] .pull-right[ - 95% confidence interval for `\(\beta_1\)`: `$$-.0119 \pm 1.96 \times (0.000823)$$` `$$= (-0.0119 - 0.001646, - 0.0119 + 0.001646)$$` `$$= (-0.0135, -0.0103)$$` ] --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Inference and goodness of fit we are usually interested in the following test: `$$H_0: \beta_1 = 0$$` `$$H_1: \beta_1 \neq 0$$` - If we assume a significance level of `\(\alpha = 0.05\)`, then because the `\(p-value < 0.05\)` we can reject the null hypothesis of `\(\beta_1 = 0\)`. Therefore FICO score is a statistically significant predictor of the probability of a loan default. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Inference and goodness of fit Estimating `\(\pi\)` > A natural type of question to ask is: What is the probability of a loan default for a FICO score of (say) 700? - As we have seen in the last segment, we can answer this from the formula of the **point estimate**, where `\(b_0\)` and `\(b_1\)` are the estimates of `\(\beta_0\)` and `\(\beta_1\)`, and `\(x = 700\)`: `$$\pi = P(Y=1) = \frac{1}{1+e^{-(\beta_0+\beta_1x)}}$$` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Inference and goodness of fit Instead of a point estimate of `\(\pi\)`, `\(p\)`, we can also compute a 95% confidence interval: `$$p \pm 1.96 \times SE(b_1)$$` The formula for SE(p) is complex, so we let R do the work: ```r p = predict(m2, newdata=data.frame(fico=700), type="response", se.fit=T) # fit, lower, upper for 95% CI out = c(p$fit, p$fit - 1.96*p$se.fit, p$fit + 1.96*p$se.fit) names(out) = c("Fit","Lower","Upper") round(out, 3) ``` ``` ## Fit Lower Upper ## 0.169 0.161 0.176 ``` The 95% confidence interval for `\(\pi\)` is (0.161, 0.177). > We are 95% confident that the probability that a borrower with a FICO score of 700 would default is between 0.161 and 0.177. ``` ## 2.5 % 97.5 % ## (Intercept) 5.59461730 7.85621388 ## fico -0.01350184 -0.01027486 ``` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Inference and goodness of fit </center> <img src="concept7_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> </center> --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Inference and goodness of fit Model accuracy - No agreed-upon method for summarizing goodness-of-fit for logistic regression. - The difficulty is that the notion of “percent variation explained” by the predictor is not as obvious as with linear regression. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Inference and goodness of fit Various suggestions have been proposed – we will use the pseudo-R2 measure (due to McFadden). - Logistic regression models are fitted using the method of maximum likelihood - i.e. the parameter estimates are those values which maximize the likelihood of the data which have been observed. McFadden's R squared measure is defined as: `$$R^{2}_{\text{McFadden}} = 1- \frac{log(L_c)}{log(L_{\text{null}})}$$` - where `\(L_c\)` denotes the (maximized) likelihood value from the current fitted model, and `\(L_{\text{null}}\)` denotes the corresponding value but for the null model - the model with only an intercept and no covariates. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Inference and goodness of fit This calculation looks similar to that of `\(R^2\)` for linear regression. - Analysis of goodness-of-fit for loan default logistic regression: ```r m2 <- glm(default ~ fico, data = loans, family = "binomial") nullm2 <- glm( default ~ 1, data = loans, family="binomial") pseudoR2 <- 1 - logLik(m2) / logLik(nullm2) round(pseudoR2, 4) ``` ``` ## 'log Lik.' 0.0271 (df=2) ``` Value is computed to be 0.0271 - Very low pseudo-R2 – roughly 2.7% of the “variation” explained by the FICO score. - Generally do not expect high `\(pseudo-R^2\)` values. - Values as high as 0.4 or 0.5 indicate a very good fit for logistic regression.