Distribution Analysis

class: center, middle, inverse, title-slide

# Distribution Analysis
### Thierry Warin, PhD
### quantum simulations<a style="color:#6f97d0">*</a>

---

background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%

# Navigation tips

- Tile view: Just press O (the letter O for Overview) at any point in your slideshow and the tile view appears. Click on a slide to jump to the slide, or press O to exit tile view.

- Draw: Click on the pen icon (top right of the slides) to start drawing.

- Search: click on the loop icon (bottom left of the slides) to start searching.

You can also click on h at any moments to have more navigations tips.

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Introduction

Three Basic Features

- Center: middle or general tendency
- Spread: small means tightly clustered, large means highly variable
- Shape: symmetry versus skewness, kurtosis

---
class: inverse, center, middle

# Outline

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%

### outline

1. Measuring the center

2. Measuring the spread

3. Measuring the shape

---
class: inverse, center, middle

# Measuring the center

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Measuring the center

## Mean

Your data will be easier to work with in R if it follows the three tidy conditions:

.pull-left[

```r
mydata <- data.frame(stack.loss)
head(mydata,10)
```

```
##    stack.loss
## 1          42
## 2          37
## 3          37
## 4          28
## 5          18
## 6          18
## 7          19
## 8          20
## 9          15
## 10         14
```

]

.pull-right[

```r
mean(mydata$stack.loss)
```

```
## [1] 17.52381
```

]

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Measuring the center

.panelset[
.panel[.panel-name[sample median]

How to find it

1. sort the data into an increasing sequence of n numbers

2. `$\tilde{x}$` lies in position `$(n + 1)/2$`

- Good: resistant to extreme values, easy to describe

- Bad: not as mathematically tractable, need to sort the data to calculate

```r
median(mydata$stack.loss)
```

```
## [1] 15
```
]
.panel[.panel-name[trimmed mean]

How to find it

1. "trim" a proportion of data from both ends of the ordered list

2. find the sample mean of what's left

- Good: also resistant to extreme values, has good properties, too

- Bad: still need to sort data to get rid of outliers

```r
mean(mydata$stack.loss, trim = 0.05)
```

```
## [1] 16.78947
```
]

.panel[.panel-name[Sample quantile]

- Quantile: approximately `$100 \times p\%$` of the data fall below the value `$\tilde{q}_{p}$`

```r
quantile(mydata$stack.loss, probs = 0.75)
```

```
## 75% 
##  19
```

```r
quantile(mydata$stack.loss, probs = c(0, 0.25, 0.80))
```

```
##  0% 25% 80% 
##   7  11  20
```

```r
quantile(mydata$stack.loss)
```

```
##   0%  25%  50%  75% 100% 
##    7   11   15   19   42
```
]
]

---
class: inverse, center, middle

# Measuring the spread

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Measuring the spread

.panelset[
.panel[.panel-name[The sample variance]

The sample variance `$s^{2}$`

`$$s^{2} = \frac{1}{n - 1}\sum_{i = 1}^{n}(x_{i} - \overline{x})^{2}$$`

The sample standard deviation is `$(s = \sqrt{s^{2}})$`

- Good: tractable, nice mathematical/statistical properties

- Bad: sensitive to extreme values

```r
var(mydata$stack.loss)
```

```
## [1] 103.4619
```
]
.panel[.panel-name[std deviation]

```r
sd(stack.loss)
```

```
## [1] 10.17162
```
]

.panel[.panel-name[IQR]

The Interquartile Range `$$IQR = \tilde{q}_{0.75} - \tilde{q}_{0.25}$$`

- Good: resistant to outliers (see session 3)

- Bad: only considers middle `$50\%$` of the data

```r
quantile(mydata$stack.loss, probs = c(0.25, 0.75))
```

```
## 25% 75% 
##  11  19
```

```r
IQR(mydata$stack.loss)
```

```
## [1] 8
```
]
]

---
class: inverse, center, middle

# Measuring the shape

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Measuring the shape

- Symmetric or skewed

- right (positive) and left (negative) skewness

- Kurtosis

- leptokurtic - steep peak, heavy tails

- platykurtic - flatter, thin tails

- mesokurtic - right in the middle
  
---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Measuring the shape

.center[

![](./images/skew_kurto.jpg)
]

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Measuring the shape

## sample skewness

The sample skewness `$g_{1}$`:

`$$g_{1} = \frac{1}{n}\frac{\sum_{i = 1}^{n}(x_{i} - \overline{x})^{3}}{s^{3}}$$`

Things to notice:

- `$-\infty < g_{1} < \infty$`

- sign of `$g_{1}$` indicates direction of skewness `$\pm)$`

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Measuring the shape

## sample skewness

```r
library(e1071)
skewness(mydata$stack.loss)
```

```
## [1] 1.156401
```

If `$\vert g_{1} \vert > 2\sqrt{6/n}$`, then the data distribution is substantially skewed in the direction of the sign of `$g_{1}$`

```r
skewness(mydata$stack.loss)
```

```
## [1] 1.156401
```

```r
2*sqrt(6/length(mydata$stack.loss))
```

```
## [1] 1.069045
```

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Measuring the shape

## sample excess kurtosis

The sample excess kurtosis `$g_{2}$`:

`$$g_{2} = \frac{1}{n}\frac{\sum_{i = 1}^{n}(x_{i} - \overline{x})^{4}}{s^{4}} - 3$$`

Things to note:

- `$-2 \leq g_{2} < \infty$`

- `$g_{2} > 0$` indicates leptokurtosis, `$g_{2} < 0$` indicates platykurtosis

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
# Measuring the shape

## sample excess kurtosis

```r
library(e1071)
kurtosis(mydata$stack.loss)
```

```
## [1] 0.1343524
```

If `$\vert g_{2} \vert > 4\sqrt{6/n}$`, then the data distribution is substantially kurtic.

```r
kurtosis(mydata$stack.loss)
```

```
## [1] 0.1343524
```

```r
4*sqrt(6/length(mydata$stack.loss))
```

```
## [1] 2.13809
```

---
class: inverse, center, middle

# Conclusion

---
background-image: url(./images/qslogo.PNG)
background-size: 100px
background-position: 90% 8%
## Conclusion

- What you have just seen should allow you already to do some **descriptive statistics**-based analyses of your data.

- This is what is called **Exploratory Data Analysis** in Data Science.

- In just a couple of slides, we have covered the basic mathematics and the R code: mean, variance, standard-deviation, median, and shape.