class: center, middle, inverse, title-slide # Data Types & Structures ### Thierry Warin, PhD ### quantum simulations
*
--- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Navigation tips - Tile view: Just press O (the letter O for Overview) at any point in your slideshow and the tile view appears. Click on a slide to jump to the slide, or press O to exit tile view. - Draw: Click on the pen icon (top right of the slides) to start drawing. - Search: click on the loop icon (bottom left of the slides) to start searching. You can also click on h at any moments to have more navigations tips. --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # A tidyverse approach "The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures." (https://www.tidyverse.org/) .center[![](images/hex-tidyverse.png)] --- class: inverse, center, middle # Outline --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% ### outline 1. Tidy Data 2. Data Types 3. Data Structures --- class: inverse, center, middle # Tidy Data --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Tidy Data Your data will be easier to work with in R if it follows the three tidy conditions: .panelset[ .panel[.panel-name[Variable] Each variable is saved in its own column <img src="./images/variables.png" width="200px" /> ] .panel[.panel-name[Observation] Each observation is saved in its own row <img src="./images/observations.png" width="200px" /> ] .panel[.panel-name[Value] Each value is placed in its own cell <img src="./images/values.png" width="200px" /> ] ] --- class: inverse, center, middle # Data Types --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Data Types R has 5 basic data types: - character - numeric - integer - logical - complex *A sixth type of data exists (raw), but we will not discuss it here* --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Data Types .panelset[ .panel[.panel-name[character] .pull-left[ ```r x <- "hello" z <- c("hello","sir") ``` ] .pull-right[ ```r typeof(x) ``` ``` ## [1] "character" ``` ```r typeof(z) ``` ``` ## [1] "character" ``` ```r length(x) ``` ``` ## [1] 1 ``` ```r length(z) ``` ``` ## [1] 2 ``` ] ] .panel[.panel-name[numeric] .pull-left[ ```r y <- c(1.2,4.6,3.5) ``` ] .pull-right[ ```r typeof(y) ``` ``` ## [1] "double" ``` ```r str(y) ``` ``` ## num [1:3] 1.2 4.6 3.5 ``` ] ] .panel[.panel-name[integer] .pull-left[ ```r y <- 1:10 ``` ] .pull-right[ ```r typeof(y) ``` ``` ## [1] "integer" ``` ```r str(y) ``` ``` ## int [1:10] 1 2 3 4 5 6 7 8 9 10 ``` ] ] .panel[.panel-name[logical] .pull-left[ ```r answer <- c(TRUE,FALSE,NA) ``` ] .pull-right[ ```r typeof(answer) ``` ``` ## [1] "logical" ``` ] ] .panel[.panel-name[complex] .pull-left[ ```r z <- 1 +2i ``` ] .pull-right[ ```r typeof(z) ``` ``` ## [1] "complex" ``` ```r class(z) ``` ``` ## [1] "complex" ``` ] ] ] --- class: inverse, center, middle # Data Structures --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Data Structures R has many data structures. These include - atomic vector - list - matrix - data frame - factors In the next slides we will talk about the atomic vector and data frame --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Data Structures .panelset[ .panel[.panel-name[Vector] A vector is a collection of elements that are most commonly of mode character, logical, integer or numeric. You can create vectors by directly specifying their content. .pull-left[ ```r x <- c(3.14,1.5,2.66) x1 <- c(1L,2L,3L) y <- c(TRUE,FALSE,FALSE,TRUE) z <- c("tech","simulation","quantum") ``` ] .pull-right[ ```r class(x) ``` ``` ## [1] "numeric" ``` ```r length(x1) ``` ``` ## [1] 3 ``` ```r str(y) ``` ``` ## logi [1:4] TRUE FALSE FALSE TRUE ``` ] ] .panel[.panel-name[Data frame] A data frame is a very important data type in R. It's pretty much the de facto data structure for most tabular data and what we use for statistics. .pull-left[ Useful functions : - head() - see first 6 rows - tail() - see last 6 rows - dim() - see dimensions - nrow() - number of rows - ncol() - number of columns - str() - structure of each column - names() - give the column names ] .pull-right[ ```r str(iris) ``` ``` ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... ``` ```r head(iris,4) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ``` ] ] ] --- background-image: url(./images/qslogo.PNG) background-size: 100px background-position: 90% 8% # Data structures for analysis .panelset[ .panel[.panel-name[Time series data] It is a collection of observations (behavior) for a single subject (entity) at different time intervals (generally equally spaced) Example: Max Temperature, Humidity and Wind (all three behaviors) in New York City (single entity) collected on First day of every year (multiple intervals of time) <img src="./images/time.png" width="500px" /> ] .panel[.panel-name[Cross-sectional data] Its is a collection of observations(behavior) for multiple subjects(entities) at single point in time. Example: Max Temperature, Humidity and Wind( all three behaviors) in New York City, SFO, Boston, Chicago(multiple entities) on 1/1/2015(single instance) <img src="./images/cross.png" width="500px" /> ] .panel[.panel-name[Panel Data (Longitudinal Data)] It is usually called as Cross-sectional Time-series data as it a combination of above mentioned types, i.e., collection of observations for multiple subjects at multiple instances. Example: Max Temperature, Humidity and Wind( all three behaviors) in New York City, SFO, Boston, Chicago(multiple entities) on First day of every year(multiple intervals of time) <img src="./images/panel.png" width="500px" /> ] ]