class: center, middle, inverse, title-slide # Introduction to R ## Seminar ### Jonas Björnerstedt ### 2021-09-21 --- --- class: inverse, center, middle # Introduction --- ## Overview - Presentation of R and Rstudio - Data management - Create a dataset - Transform data - Data analysis - Exploratory data analysis - Regressions #### Prerequisites - From the beginning - Starting point: Excel spreadsheet --- ## Zoom * Switching from Zoom to Rstudio quickly * Use Alt-Tab (Cmd-Tab on mac) * Switches to previous program * Easier to hop back and forth --- class: inverse, center, middle # Into to Rstudio --- ## Why R? - Open source - R is free - Old argument - More and more common that programs are free - Google: Android and Google Docs etc. - R is Open source - Supported by the industry: [R Consortium](https://www.r-consortium.org/members) - [Google Colab](https://colab.research.google.com/#create=true&language=r) can be used for R - Best environment for data management and analysis - Can be used for this only - Popular, a lot on the net: [The Popularity of Data Science Software](http://r4stats.com/articles/popularity/) --- ## Why learn R?! - R best for data management and data analysis - You can use R for this and other program for regressions - Data management takes a lot of time - Important to do it effectively - Have more time to do _Exploratory Data Analysis_ - Better than Excel - Excel is error prone - Tools are more 'compatible' now - Use the best tool for the task --- class: inverse, center, middle # R in practice --- ## Login to computer lab ### http://89.45.234.71:8787 <!-- Comment --------------------- https://sites.google.com/view/sh-econometrics ------------------------------ --> .pull-left[ * Go to site * Click on Datalab link * User name: your full email address provided in _Contact info_ * Carl.Lund@gmail.com har username __Carl.Lund@gmail.com__ * Password: ### You provide this in the _Contact info_ ] .pull-right[ ![](figures/login.png) ] --- ## Rstudio overview - Rstudio is an Integrated Development Environment (IDE) for the statistical language R - Used by programmers, analysts and statisticians - You can do _a lot_ of different things in Rstudio - Ignore menus, windows and symbols that you are not acquainted with - Three windows (_Panes_) are visible in the environment: - Size can be changed, and they can be minimized --- ## Windows Common windows: - Console - execute code - Environment - see defined variables - Files - manage files - Help - help text - Tutorial - step by step instructions --- ## Variables - Numbers: `a = 2` - Text string: `k = "Hej"` - _Environment_ window shows defined variables - With assignment `=` no result is displayed - Without assignment content is displayed - output short and a little cryptic ```r a = 2 a ``` ``` ## [1] 2 ``` --- ## Expressions and rows - R continues to parse code until a complete expression is found ```r a = 2 + 2 ``` - Used when expressions get long - for example when plotting - Everything that follows a `#` character is a comment ```r 2+2 ``` ``` ## [1] 4 ``` ```r # This is a comment, not executed ``` --- ## Functions - Provide input arguments - Often results in an output value - Functions can return a value: `round(3.21)` is 3 - The value can be put in a new variable ```r x = 3.21 y = round(x) y ``` ``` ## [1] 3 ``` - Functions can also be used to _do something_ - They are commonly used for _side results_ rather than returning a value - For example save dataset or plot a figure --- ## Function syntax - R is a _functional_ language - Everything is done with functions - Functions sometimes take several arguments - Which order should they be provided? - Can use _named arguments_ * Ex: round to first decimal ```r round(x, digits = 1) ``` ``` ## [1] 3.2 ``` - Help file - _code completion_ and hover text boxes help --- ## Console and script - Console - Execute line by line - Up arrow - Script - Ctrl-Enter to execute line - Ctrl-Enter on selected text executes selected code - Output in console - To create script - See menu: File > New File - The first alternative: - R Script --- ## Vectors - A _vector_ contains several different values of the same type - Created with the `c()` function ```r v = c(2,4,6) v ``` ``` ## [1] 2 4 6 ``` ```r v[2] ``` ``` ## [1] 4 ``` --- ## Packages - A package defines a set of functions - By loading a package the set of functions that can be used is changed - R is a language that you can modify - Overview of packages in __Packages__ window - You have to _load_ a package before using a function defined in it ```r library(ggplot2) ``` --- ## Help - R - statistics program with by far most help info on the net - Help menu, cheat sheets - Help files in the _Help_ pane - look at the examples at the end - Videos on the net - Courses in data analys on the net - Best help tool: Google - Questions on forums like _stackexchange.com_ - Many people searching implies good matches in google searches --- class: inverse, center, middle # Dataset (data frames) --- ## Dataset - Dataset - data.frame - Table with named columns - Each column has many observations of the same type (rows) - integer / floating point / text string / date ... - Columns can include complicated data structures - Ex: geographical shapes - Open data frames - Double click on file in __Files__ window --- ## Example datasets - Packages can include datasets 1. Open a library with `library()` 2. Load the dataset in the library with function `data()` ```r library(gapminder) data("gapminder") ``` - Can be viewed in the __Environment__ pane --- class: inverse, center, middle # Visualizing data --- ## Plot data with ggplot - Based on the "Grammar of graphics" - Many different plot types - Use Esquisse ggplot builder - Under __Addins__ in toolbar --- ## ggplot grammar ```r library(ggplot2) ggplot(data = mpg) + aes(x = displ, y = hwy) + geom_point() ``` ![](introShort_files/figure-html/unnamed-chunk-11-1.png)<!-- --> ```r ggplot(mtcars) + aes(x = mpg, y = hp, colour = cyl, size = carb) + geom_point() + scale_color_gradient() + labs(x = "X", y = "Y", title = "Hej", subtitle = "HÅ", caption = "Test", color = "Cylinder", size = "Carb") + theme_gray() ``` ![](introShort_files/figure-html/unnamed-chunk-12-1.png)<!-- --> - [There is a chapter on plotting](http://r4ds.had.co.nz/data-visualisation.html#the-layered-grammar-of-graphics) in Wickhams textbook --- ## Multiple lines - A plot can be built from parts ```r nolegend = theme(legend.position = "none") ggplot(gapminder) + aes(year, lifeExp, color = country) + geom_line() + nolegend ``` ![](introShort_files/figure-html/unnamed-chunk-13-1.png)<!-- --> --- # Data manipulation --- ## Variable names Variable names can consist of letters, numbers, underline character _ and period . ```r Örsköldsvik = 2 MyCamelCaseVariable = 3 my_underscore_variable = 4 ``` --- ## dplyr package in tidyverse The dplyr package is for data management. The most important functions for restructuring data are: - `filter` - Select rows - `mutate` - Create or modify variables (columns) - `summarise` - Create a smaller dataset (sum, average,...) - `group_by` - Create groups - to summarise by group - `arrange` - Sort There is a good _cheat sheet_. Se meny: `Help > Cheatsheets > Data transformations with dplyr` --- ## Conditions ```r 2 < 3 ``` ``` ## [1] TRUE ``` ```r 1 == 0 ``` ``` ## [1] FALSE ``` ```r "Stockholm" == "S" ``` ``` ## [1] FALSE ``` ```r x = (2 < 3) x ``` ``` ## [1] TRUE ``` --- ## Filtrering and conditions ```r afg = filter(gapminder, country == "Afghanistan") afg ``` ``` ## # A tibble: 12 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786. ## 7 Afghanistan Asia 1982 39.9 12881816 978. ## 8 Afghanistan Asia 1987 40.8 13867957 852. ## 9 Afghanistan Asia 1992 41.7 16317921 649. ## 10 Afghanistan Asia 1997 41.8 22227415 635. ## 11 Afghanistan Asia 2002 42.1 25268405 727. ## 12 Afghanistan Asia 2007 43.8 31889923 975. ``` --- ## Afghanistan's GDP ```r ggplot(afg) + aes(year, gdpPercap) + geom_line() + geom_vline(xintercept = 1988, color="blue") + geom_vline(xintercept = 2002, color="green") ``` ![](introShort_files/figure-html/unnamed-chunk-18-1.png)<!-- --> --- ## Create and change variables * Population in billions - redefind the variable `pop` in `df` ```r df = mutate(afg, pop = pop/10^9) ggplot(df) + aes(year,pop) + geom_line() ``` ![](introShort_files/figure-html/unnamed-chunk-19-1.png)<!-- --> --- ## Import data * Use Comma Separated Values file (CSV-file) * Separator can be a semicolon * Open with _Environment > Import Datset_ or select in _File_ pane .pull-left[ ```r indata = " kommun, år, pris Solna, 2010, 123 Solna, 2011, 126 Solna, 2012, 128 Stockholm, 2010, 133 Stockholm, 2011, 143 Stockholm, 2012, 163" ``` ] .pull-right[ ```r df = read_csv(indata) kable(df) ``` |kommun | år| pris| |:---------|----:|----:| |Solna | 2010| 123| |Solna | 2011| 126| |Solna | 2012| 128| |Stockholm | 2010| 133| |Stockholm | 2011| 143| |Stockholm | 2012| 163| ] --- ## Compile report To generate a Word-file with your analysis choose File - Compile report... or click on the Notebook icon in the toolbar at the top of the pane