class: center, middle, inverse, title-slide # Introduction to R ## Seminar 2 ### Jonas Björnerstedt ### 2021-10-03 --- class: inverse, center, middle # Dataset (data frames) --- ## Vectors - A _vector_ contains several different values of the same type - Created with the `c()` function ```r v = c(2,4,6) v ``` ``` ## [1] 2 4 6 ``` ```r v[2] ``` ``` ## [1] 4 ``` --- ## Lists - A list is a collection of named values - Values can be of different types ```r countrydata = list(country = "Belgium", population = 13,5) countrydata$country ``` ``` ## [1] "Belgium" ``` --- ## A dataset is a kind of a list - Example: put vector of names in first and numbers in the second ```r countrydata = list(country = c("Belgium", "Sweden"), population = c(13.5, 10.1)) countrydata ``` ``` ## $country ## [1] "Belgium" "Sweden" ## ## $population ## [1] 13.5 10.1 ``` - Can access columns and elements ```r countrydata$country ``` ``` ## [1] "Belgium" "Sweden" ``` ```r countrydata$population[2] ``` ``` ## [1] 10.1 ``` --- ## Dataset - Dataset - data.frame - Table with named columns - Each column has many observations of the same type (rows) - integer / floating point / text string / date ... - Columns can include complicated data structures - Ex: geographical shapes - Open data frames - Double click on file in __Files__ window --- ## data.frame and _objects_ - Convert our list into a `data.frame` object ```r df = as.data.frame(countrydata) df ``` ``` ## country population ## 1 Belgium 13.5 ## 2 Sweden 10.1 ``` - Different types of objects in R - Can be simple (strings) or complicated (lists or data.frames) - Can put an object in a variable - Note that `df` is a variable in Global Environment --- ## data.frame and _objects_ - Objects have types - Ex. `data.frame` and `list` objects - A `data.frame` is a kind of `list` - It behaves like a list - But it has more features - Ex: a cat is a type of mammal - Has the properties of all animals (heart, eyes) - But has more "features" (whiskers) - Different types of dataframes exist --- ## Example datasets - Packages can include datasets 1. Install a package in the __Packages__ pane - Click on Install and enter _AER_ to install package 1. Open a library with `library()` 2. Load the dataset in the library with function `data()` ```r library(gapminder) data("gapminder") ``` - Can be viewed in the __Environment__ pane --- class: inverse, center, middle # Visualizing data --- ## Plot data with ggplot - Based on the "Grammar of graphics" - Many different plot types - Use Esquisse ggplot builder - Under __Addins__ in toolbar --- ## ggplot grammar ```r library(ggplot2) ggplot(data = mpg) + aes(x = displ, y = hwy) + geom_point() ``` ![](intro2_files/figure-html/unnamed-chunk-9-1.png)<!-- --> ```r ggplot(mtcars) + aes(x = mpg, y = hp, colour = cyl, size = carb) + geom_point() + scale_color_gradient() + labs(x = "X", y = "Y", title = "Hej", subtitle = "HÅ", caption = "Test", color = "Cylinder", size = "Carb") + theme_gray() ``` --- ## ggplot components ggplot has components. Three are necessary 1. data - dataset with variables 2. aesthetic - mapping of variables to various properties 3. geometry - type of plot: points, line, etc... * There can be several of each, overlays etc * There are additional things: themes, multiple plots, ... ``` ggplot(data = <DATA>) + aes(<MAPPINGS>) <GEOM_FUNCTION>(<OPTIONS>) ``` - [There is a chapter on plotting](http://r4ds.had.co.nz/data-visualisation.html#the-layered-grammar-of-graphics) in Wickhams textbook --- ## ggplot themes - Themes can be used to change general appearance - [Package ggthemes](https://jrnold.github.io/ggthemes/) includes several themes - One or more theme components can be added - Can be used to change what is included - To remove explanatory _legend_ you can add: ```r ggplot() + ... * + theme(legend.position = "none") ``` --- ## Multiple lines - A plot can be built from parts ```r nolegend = theme(legend.position = "none") ggplot(gapminder) + aes(year, lifeExp, color = country) + geom_line() + nolegend ``` ![](intro2_files/figure-html/unnamed-chunk-12-1.png)<!-- --> --- class: inverse, center, middle # Data manipulation --- ## Variable names Variable names can consist of letters, numbers, underline character _ and period . ```r Örsköldsvik = 2 MyCamelCaseVariable = 3 my_underscore_variable = 4 ``` You can have really long names: ```r `My neat variable` = 4 ``` --- ## dplyr package in tidyverse The dplyr package is for data management. The most important functions for restructuring data are: - `filter` - Select rows - `select` - Select columns - `mutate` - Create or modify variables (columns) - `summarise` - Create a smaller dataset (sum, average,...) - `group_by` - Create groups - to summarise by group There is a good _cheat sheet_. Se meny: `Help > Cheatsheets > Data transformations with dplyr` --- ## Data managment in R with dplyr The functions in `dplyr` create a new dataset `df2` from dataset `df1` - `df2 = filter(df1, conditions)` - Choose rows based on condition - `df2 = mutate(df1, formulas)` - Create or modify variables - `df2 = select(df1, -column)` - Remove column from dataset --- ## Assignment and equality * Programming languages have two types of equality 1. Assignment - set equal `x = 2` `x <- 2` 2. Comparison - are they equal? (true/false) `x == 2` * Assignment __not the same__ as equality in math. We can write `x = x + 2` * Set the value of `x` equal to it's previous value plus 2 --- ## Comparison - logical operators * Comparison is an operator just like: + - * / * Returns `TRUE` or `FALSE` ```r 2 < 3 ``` ``` ## [1] TRUE ``` ```r 1 == 0 ``` ``` ## [1] FALSE ``` ```r "Stockholm" == "S" ``` ``` ## [1] FALSE ``` ```r x = (2 < 3) x ``` ``` ## [1] TRUE ``` --- ## Filtrering and conditions ```r afg = filter(gapminder, country == "Afghanistan", year < 1970) afg ``` ``` ## # A tibble: 4 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ``` --- ## Filtrering and plot ```r afg = filter(gapminder, country == "Afghanistan") germany = filter(gapminder, country == "Germany") ggplot(afg) + aes(year, gdpPercap) + geom_line() + geom_line(data = germany) ``` ![](intro2_files/figure-html/unnamed-chunk-18-1.png)<!-- --> --- ## Afghanistans GDP ```r ggplot(afg) + aes(year, gdpPercap) + geom_line() + geom_vline(xintercept = 1988, color="blue") + geom_vline(xintercept = 2002, color="green") ``` ![](intro2_files/figure-html/unnamed-chunk-19-1.png)<!-- --> --- ## Import data * Use Comma Separated Values file (CSV-file) * Separator can be a semicolon * Open with _Environment > Import Datset_ or select in _File_ pane .pull-left[ ```r indata = " kommun, år, pris Solna, 2010, 123 Solna, 2011, 126 Solna, 2012, 128 Stockholm, 2010, 133 Stockholm, 2011, 143 Stockholm, 2012, 163" ``` ] .pull-right[ ```r df = read_csv(indata) kable(df) ``` |kommun | år| pris| |:---------|----:|----:| |Solna | 2010| 123| |Solna | 2011| 126| |Solna | 2012| 128| |Stockholm | 2010| 133| |Stockholm | 2011| 143| |Stockholm | 2012| 163| ] --- ## Missing values - Price missing for 1999 - Missing value, NA (Not Available) .pull-left[ ```r indata = "år, pris 1998,1 1999, 2000,3 2001,4" ``` ] .pull-right[ ```r df = read_csv(indata) kable(df) ``` | år| pris| |----:|----:| | 1998| 1| | 1999| NA| | 2000| 3| | 2001| 4| ]