class: center, middle, inverse, title-slide # Distribution Analysis ### Thierry Warin, PhD ### quantum simulations
*
--- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Navigation tips - Tile view: Just press O (the letter O for Overview) at any point in your slideshow and the tile view appears. Click on a slide to jump to the slide, or press O to exit tile view. - Draw: Click on the pen icon (top right of the slides) to start drawing. - Search: click on the loop icon (bottom left of the slides) to start searching. You can also click on h at any moments to have more navigations tips. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Introduction Three Basic Features - Center: middle or general tendency - Spread: small means tightly clustered, large means highly variable - Shape: symmetry versus skewness, kurtosis --- class: inverse, center, middle # Outline --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% ### outline 1. Measuring the center 2. Measuring the spread 3. Measuring the shape --- class: inverse, center, middle # Measuring the center --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Measuring the center ## Mean Your data will be easier to work with in R if it follows the three tidy conditions: .pull-left[ ```r mydata <- data.frame(stack.loss) head(mydata,10) ``` ``` ## stack.loss ## 1 42 ## 2 37 ## 3 37 ## 4 28 ## 5 18 ## 6 18 ## 7 19 ## 8 20 ## 9 15 ## 10 14 ``` ] .pull-right[ ```r mean(mydata$stack.loss) ``` ``` ## [1] 17.52381 ``` ] --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Measuring the center .panelset[ .panel[.panel-name[sample median] How to find it 1. sort the data into an increasing sequence of n numbers 2. `\(\tilde{x}\)` lies in position `\((n + 1)/2\)` - Good: resistant to extreme values, easy to describe - Bad: not as mathematically tractable, need to sort the data to calculate ```r median(mydata$stack.loss) ``` ``` ## [1] 15 ``` ] .panel[.panel-name[trimmed mean] How to find it 1. "trim" a proportion of data from both ends of the ordered list 2. find the sample mean of what's left - Good: also resistant to extreme values, has good properties, too - Bad: still need to sort data to get rid of outliers ```r mean(mydata$stack.loss, trim = 0.05) ``` ``` ## [1] 16.78947 ``` ] .panel[.panel-name[Sample quantile] - Quantile: approximately `\(100 \times p\%\)` of the data fall below the value `\(\tilde{q}_{p}\)` ```r quantile(mydata$stack.loss, probs = 0.75) ``` ``` ## 75% ## 19 ``` ```r quantile(mydata$stack.loss, probs = c(0, 0.25, 0.80)) ``` ``` ## 0% 25% 80% ## 7 11 20 ``` ```r quantile(mydata$stack.loss) ``` ``` ## 0% 25% 50% 75% 100% ## 7 11 15 19 42 ``` ] ] --- class: inverse, center, middle # Measuring the spread --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Measuring the spread .panelset[ .panel[.panel-name[The sample variance] The sample variance `\(s^{2}\)` `$$s^{2} = \frac{1}{n - 1}\sum_{i = 1}^{n}(x_{i} - \overline{x})^{2}$$` The sample standard deviation is `\((s = \sqrt{s^{2}})\)` - Good: tractable, nice mathematical/statistical properties - Bad: sensitive to extreme values ```r var(mydata$stack.loss) ``` ``` ## [1] 103.4619 ``` ] .panel[.panel-name[std deviation] ```r sd(stack.loss) ``` ``` ## [1] 10.17162 ``` ] .panel[.panel-name[IQR] The Interquartile Range `$$IQR = \tilde{q}_{0.75} - \tilde{q}_{0.25}$$` - Good: resistant to outliers (see session 3) - Bad: only considers middle `\(50\%\)` of the data ```r quantile(mydata$stack.loss, probs = c(0.25, 0.75)) ``` ``` ## 25% 75% ## 11 19 ``` ```r IQR(mydata$stack.loss) ``` ``` ## [1] 8 ``` ] ] --- class: inverse, center, middle # Measuring the shape --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Measuring the shape - Symmetric or skewed - right (positive) and left (negative) skewness - Kurtosis - leptokurtic - steep peak, heavy tails - platykurtic - flatter, thin tails - mesokurtic - right in the middle --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Measuring the shape .center[ ![](./images/skew_kurto.jpg) ] --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Measuring the shape ## sample skewness The sample skewness `\(g_{1}\)`: `$$g_{1} = \frac{1}{n}\frac{\sum_{i = 1}^{n}(x_{i} - \overline{x})^{3}}{s^{3}}$$` Things to notice: - `\(-\infty < g_{1} < \infty\)` - sign of `\(g_{1}\)` indicates direction of skewness `\(\pm)\)` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Measuring the shape ## sample skewness ```r library(e1071) skewness(mydata$stack.loss) ``` ``` ## [1] 1.156401 ``` If `\(\vert g_{1} \vert > 2\sqrt{6/n}\)`, then the data distribution is substantially skewed in the direction of the sign of `\(g_{1}\)` ```r skewness(mydata$stack.loss) ``` ``` ## [1] 1.156401 ``` ```r 2*sqrt(6/length(mydata$stack.loss)) ``` ``` ## [1] 1.069045 ``` --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Measuring the shape ## sample excess kurtosis The sample excess kurtosis `\(g_{2}\)`: `$$g_{2} = \frac{1}{n}\frac{\sum_{i = 1}^{n}(x_{i} - \overline{x})^{4}}{s^{4}} - 3$$` Things to note: - `\(-2 \leq g_{2} < \infty\)` - `\(g_{2} > 0\)` indicates leptokurtosis, `\(g_{2} < 0\)` indicates platykurtosis --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Measuring the shape ## sample excess kurtosis ```r library(e1071) kurtosis(mydata$stack.loss) ``` ``` ## [1] 0.1343524 ``` If `\(\vert g_{2} \vert > 4\sqrt{6/n}\)`, then the data distribution is substantially kurtic. ```r kurtosis(mydata$stack.loss) ``` ``` ## [1] 0.1343524 ``` ```r 4*sqrt(6/length(mydata$stack.loss)) ``` ``` ## [1] 2.13809 ``` --- class: inverse, center, middle # Conclusion --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% ## Conclusion - What you have just seen should allow you already to do some **descriptive statistics**-based analyses of your data. - This is what is called **Exploratory Data Analysis** in Data Science. - In just a couple of slides, we have covered the basic mathematics and the R code: mean, variance, standard-deviation, median, and shape.