Logistic Regression

class: center, middle, inverse, title-slide

# Logistic Regression
### Thierry Warin, PhD
### quantum simulations<a style="color:#6f97d0">*</a>

---

background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%

# Navigation tips

- Tile view: Just press O (the letter O for Overview) at any point in your slideshow and the tile view appears. Click on a slide to jump to the slide, or press O to exit tile view.

- Draw: Click on the pen icon (top right of the slides) to start drawing.

- Search: click on the loop icon (bottom left of the slides) to start searching.

You can also click on h at any moments to have more navigations tips.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%

As we saw in the previous concepts, linear regression allows us to model the relationship between a quantitative dependent variable Y and a set of independent variables.

> In this concept we extend our modeling capabilities and learn how to model the relationship between a **binary dependent variable Y** and a set of independent variables.

---
class: inverse, center, middle

# Outline

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%

### outline

1. Logistic regression

2. Inference and goodness of fit

---
class: inverse, center, middle

# Logistic regression

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Logistic regression

Recall that a binary dependent variable can take on only one of two values; e.g. `$Y = 0$` or `$Y = 1$`.

Conventionally `$1$` represents a success and `$0$` a failure, regardless of the outcome.

Examples of this type of variable in practice are:

- `$Y = 1$` if defaulted on loan, 0 otherwise.

- `$Y = 1$` if made online purchase, 0 otherwise.

- `$Y = 1$` if renewed car lease, 0 otherwise.

- `$Y = 1$` if employee promoted, 0 otherwise.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Logistic regression

We showed in the previous segment that when a binary dependent variable is modeled using linear regression, the underlying model being estimated is

`$$\pi = P(Y=1) = \beta_0 + \beta_1 x$$`

As we saw, issues can arise when making predictions since this model does not prevent probabilities from being larger than 0 or less than 1.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Logistic regression
Logistic regression is a way to model a binary dependent variable.

- The logistic regression model states that:

`$$\pi = P(Y=1) = \frac{e^{(\beta_0+\beta_1x)}}{1+e^{(\beta_0+\beta_1x)}}$$`

`$$\pi = P(Y=1) = \frac{1}{1+e^{-(\beta_0+\beta_1x)}}$$`
---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Logistic regression

.pull-left[

The weird looking function is called the logistic function, and its purpose is to ensure that predictions will always be between 0 to 1.

It is called the S-shape function.

]

.pull-right[

]

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Logistic regression

The logistic regression model can be estimated by the method of **maximum likelihood estimation**, a common approach for estimating parameters in statistical models.

- The estimated probability that `$Y = 1$` given `$x$` is:

`$$p = \frac{1}{1 + e^{-(b_0 + b_1 x)}}$$`

- `$b_0$` is the estimate of `$\beta_0$`.

- `$b_1$` is the estimate of `$\beta_1$`.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Logistic regression

For our loan default data, the R output from estimating a logistic regression is as follows:

.panelset[
.panel[.panel-name[R Code]

```r
loans = readr::read_csv("https://www.warin.ca/datalake/courses_data/qmibr/session7/loans.csv")

# rename the dependent variable for ease of use
loans$default = loans$not.fully.paid 
m2 <- glm(default ~ fico, data = loans, family = "binomial")
summary(m2)
```
]

.panel[.panel-name[output]

]
]

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Logistic regression

Using the R output, the estimated equation for the probability of a loan default is:

`$$p = \frac{1}{1 + e^{-(6.72 - 0.012 x)}}$$`

With this equation we can predict probability of loan default for any FICO score `$x$`.

- Using the estimated equation, if someone’s FICO score is 600 the probability of default is:

`$$p = \frac{1}{1 + e^{-(6.72 - 0.012 \times 600)}} = 0.38$$`

- If someone’s FICO score is 850 the probability of default is:

`$$p = \frac{1}{1 + e^{-(6.72 - 0.012 \times 850)}} = 0.03$$`

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Logistic regression

---
class: inverse, center, middle

# Inference and goodness of fit

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Inference and goodness of fit

We learned how to estimate the coefficients of a simple logistic regression and obtain predictions.

We now focus on:

- **Hypothesis testing** for the coefficient of a predictor in logistic regression.

- **Confidence intervals** for probabilities.

- An **analogous measure** to `$R^2$` in linear regression.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Inference and goodness of fit

Again, the simple logistic regression model is given by:

`$$\pi = P(Y=1) = \frac{1}{1+e^{-(\beta_0+\beta_1x)}}$$`

- We usually consider both confidence intervals and hypothesis tests for `$\beta_1$` (btw, not `$b_1$`).

- Because the value of `$\beta_1$` is difficult to interpret, we will focus mainly on hypothesis testing for `$\beta_1 = 0$`.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Inference and goodness of fit

```
## 
## Call:
## glm(formula = default ~ fico, family = "binomial", data = loans)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9544  -0.6518  -0.5538  -0.4185   2.4863  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  6.7188776  0.5768348   11.65   <2e-16 ***
## fico        -0.0118775  0.0008231  -14.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8424.0  on 9577  degrees of freedom
## Residual deviance: 8196.1  on 9576  degrees of freedom
## AIC: 8200.1
## 
## Number of Fisher Scoring iterations: 4
```

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Inference and goodness of fit

The estimate `$b_1 = - 0.0119$` is difficult to interpret exactly.

But an important question is whether the data provide evidence that `$\beta_1 \neq 0$`, which would imply that FICO scores are related to the probability of a loan default.

We could construct a confidence interval for `$\beta_1$` using the same approach as we did with linear regression.

`$$b_1 \pm 1.96 \times SE(b_1)$$`

- Usually difficult to make sense out of the exact set of values.

- However, we can assess whether `$\beta_1$` is positive or negative through a confidence interval.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Inference and goodness of fit

.pull-left[

```r
# load the data set and summarize the included variables
loans = readr::read_csv("https://www.warin.ca/datalake/courses_data/qmibr/session7/loans.csv")

# rename the dependent variable for ease of use
loans$default = loans$not.fully.paid 
m2 <- glm(default ~ fico, data = loans, family = "binomial")
summary(m2)
```

]

.pull-right[

- 95% confidence interval for `$\beta_1$`:

`$$-.0119 \pm 1.96 \times (0.000823)$$`

`$$= (-0.0119 - 0.001646, - 0.0119 + 0.001646)$$`

`$$= (-0.0135, -0.0103)$$`

]

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Inference and goodness of fit

we are usually interested in the following test:

`$$H_0: \beta_1 = 0$$`

`$$H_1: \beta_1 \neq 0$$`

- If we assume a significance level of `$\alpha = 0.05$`, then because the `$p-value < 0.05$` we can reject the null hypothesis of `$\beta_1 = 0$`.

Therefore FICO score is a statistically significant predictor of the probability of a loan default.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Inference and goodness of fit

Estimating `$\pi$`

> A natural type of question to ask is:  What is the probability of a loan default for a FICO score of (say) 700?

- As we have seen in the last segment, we can answer this from the formula of the **point estimate**, where `$b_0$` and `$b_1$` are the estimates of `$\beta_0$` and `$\beta_1$`, and `$x = 700$`:

`$$\pi = P(Y=1) = \frac{1}{1+e^{-(\beta_0+\beta_1x)}}$$`

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Inference and goodness of fit

Instead of a point estimate of `$\pi$`, `$p$`, we can also compute a 95% confidence interval:

`$$p \pm 1.96 \times SE(b_1)$$`

The formula for SE(p) is complex, so we let R do the work:

```r
p = predict(m2, newdata=data.frame(fico=700), type="response", se.fit=T)
# fit, lower, upper for 95% CI
out = c(p$fit, p$fit - 1.96*p$se.fit, p$fit + 1.96*p$se.fit)
names(out) = c("Fit","Lower","Upper")
round(out, 3)
```

```
##   Fit Lower Upper 
## 0.169 0.161 0.176
```

The 95% confidence interval for `$\pi$` is (0.161, 0.177).

> We are 95% confident that the probability that a borrower with a FICO score of 700 would default is between 0.161 and 0.177.

```
##                   2.5 %      97.5 %
## (Intercept)  5.59461730  7.85621388
## fico        -0.01350184 -0.01027486
```

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Inference and goodness of fit

</center>

</center>

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Inference and goodness of fit

Model accuracy

- No agreed-upon method for summarizing goodness-of-fit for logistic regression.

- The difficulty is that the notion of “percent variation explained” by the predictor is not as obvious as with linear regression.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Inference and goodness of fit

Various suggestions have been proposed – we will use the pseudo-R2 measure (due to McFadden).

- Logistic regression models are fitted using the method of maximum likelihood - i.e. the parameter estimates are those values which maximize the likelihood of the data which have been observed. McFadden's R squared measure is defined as:

`$$R^{2}_{\text{McFadden}} = 1- \frac{log(L_c)}{log(L_{\text{null}})}$$`

- where `$L_c$` denotes the (maximized) likelihood value from the current fitted model, and `$L_{\text{null}}$` denotes the corresponding value but for the null model - the model with only an intercept and no covariates.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Inference and goodness of fit

This calculation looks similar to that of `$R^2$` for linear regression.

- Analysis of goodness-of-fit for loan default logistic regression:

```r
m2 <- glm(default ~ fico, data = loans, family = "binomial")
nullm2 <- glm( default ~ 1, data = loans, family="binomial")
pseudoR2 <- 1 - logLik(m2) / logLik(nullm2)
round(pseudoR2, 4)
```

```
## 'log Lik.' 0.0271 (df=2)
```

Value is computed to be 0.0271

- Very low pseudo-R2 – roughly 2.7% of the “variation” explained by the FICO score.

- Generally do not expect high `$pseudo-R^2$` values.

- Values as high as 0.4 or 0.5 indicate a very good fit for logistic regression.