class: center, middle, inverse, title-slide # Econometrics B/C - Lecture 5 ## Sampling and uncertainty ### Jonas BjΓΆrnerstedt ### 2021-10-26 --- ## Lecture Content Chapter 6 in text 1. Sampling and uncertainty 1. Mean - coin toss 2. Regression 3. Heteroskedasticity 2. Causation 3. Data pipes --- class: inverse, center, middle # Sampling and uncertainty --- ## Coin toss example - A _fair coin_ has equal probability of heads and tails - How do we deterimine if a coin is fair? -- 1. Toss the coin many times 1. Assign 1 if outcome is heads and -1 if tails 1. Take the average 1. Check if the average is close to zero -- - But how can be sure it is fair if for example the average is 0.1 instead of zero? - How sure are we? --- ## Distribution of mean - What is the relationship between the expected value in the population and sample means taken from the population? - Can we say how close the average height of 100 randomly sampled people is to the mean value in the population (the expected value)? - Two approaches: - Calculate the distribution of the mean - The Central Limit Theorem (CLT) --- ## Design of a test - Select a _null hypothesis_ `\(H_0\)` - Create a statistic with a known distribution under `\(H_0\)` - Calculate how likely the outcome is given the data - Reject `\(H_0\)` if the probability of the actual outcome occurring is below a chosen critical level --- ## Exact distribution - If we know the distribution of the random variable `\(X\)`, we can calculate the distribution of `\(\bar X\)` - Binomial distribution of fair coin - Mean has distribution that we can calculate - Given this calculated distribution, we can see how likely the mean in our sample is - Reject `\(H_0\)` if it is _very_ unlikely (given some choice of threshold) --- ## [Coin toss - distribution](http://rstudio.sh.se/content/statistics02-figs#section-distribution) <sup> π </sup> .pull-left[ Outcome | Probability | ---------: | -------------: | 1 | 1/2 -1 | 1/2 * Expected value: `$$E(Y) = -1*0.5 + 1*0.5 = 0$$` * Variance: `$$Var(Y) = (-1)^2*0.5 + (1)^2*0.5 = 1$$` ] .pull-right[ ![](lecture05_files/figure-html/unnamed-chunk-1-1.png)<!-- --> ] --- ## [Distribution of mean `\(\bar Y\)` - 2 obs](http://rstudio.sh.se/content/statistics02-figs#section-distribution) <sup> π </sup> .pull-left[ First | Second | Mean | ---------: | -------------: | ------:| 1 | 1 | 1 1 | -1 | 0 -1 | 1 | 0 -1 | -1 | -1 * Expected value: `$$E(\bar Y) = -1*\frac{1}{4} + 1*\frac{1}{4} + 0*\frac{1}{2} = 0$$` ] .pull-right[ ![](lecture05_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] * Variance: `$$Var(\bar Y) = (-1-0)^2 \frac{1}{4} + (1-0)^2*\frac{1}{4} + (0-0)^2\frac{1}{2} = \frac{1}{2}$$` --- ## Law of large numbers - Taking means with more draws gives more certainty --- ## [Almost normal distribution](http://rstudio.sh.se/content/statistics02-figs#section-almost-normal) <sup> π </sup> ![](lecture05_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- ## Central Limit Theorem (CLT) - Properties of the mean `\(\bar Y\)` of random samples of `\(Y\)` - Central Limit Theorem (CLT) - Take mean of independent random draws of `\(Y\)` - Draws can have (almost) any distribution 1. Large sample imply that mean is almost normal - Normal distribution is characterized by mean and variance 2. Mean will be close to expected value 3. Mean will vary due to sampling - _Standard error_ of `\(\bar Y\)` - Variance proportional to variance of `\(Y\)` - Variance inversely proportional to sample size `\(n\)` `$$Var(\bar Y) = \frac{Var(Y)}{n}$$` --- ## Sampling uncertainty Example: estimating the average height of the population - Mean of sample - our estimate of population mean - Can observe how heights differ around the estimated mean - Estimate of the variance of the population height in sample - Can calculate the how the mean will vary due to sampling - Can estimate how much sampling affects estimate - Uncertainty of estimated mean --- ## Standard deviation and standard error - Can calculate the _standard error_ - Square root of `\(Var(\bar Y)\)` `$$SE(\bar Y) = \sqrt{Var(\bar Y)} = \sqrt{\frac{Var(Y)}{n} } = \frac{SD(Y)}{\sqrt{n} }$$` * As `\(\bar Y\)` has almost normal distribution * our estimate will vary due to sampling * 95% of the draws will be within `\(2SE(\bar Y)\)` however * Confidence interval! --- ## Estimating the population average ```r library(statar) sum_up(select(lengths, length)) # Summary statistics of lengths ``` ``` Variable β Obs Missing Mean StdDev Min Max ββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββ length β 23 0 172.435 12.8056 140 192 ``` - The simplest regression. Formula `length ~ 1` used to estimate __only__ intercept ```r lm_robust(length ~ 1, data = lengths) ``` ``` Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF (Intercept) 172.4348 2.670159 64.57848 1.400958e-26 166.8972 177.9724 22 ``` * Same coefficient 172.4348 * Standard error of estimate `\(2.6701587 = \frac{SD(lengths)}{\sqrt{n}} = \frac{12.8056312}{\sqrt{23}}\)` * We can calculate a __confidence interval__! --- ## Population equation - How does `\(Y\)` depend on `\(X\)`? - Cannot hope to fully describe the relationship - Focus on the *conditional expectation*: - How does the *expected value* of `\(Y\)` depend on `\(X\)`? - To do this, we want a function `\(f(X)\)` such that `$$E(Y|X)=f(X)$$` - In linear models `\(f\)` is given by the _population regression_ line `$$E(Y|X) = \beta_{0}+\beta_{1} X$$` - `\(\beta_{0}\)` and `\(\beta_{1}\)` are _parameters_ in the model --- ## Linear regression model - Given a _population regression_ function `$$E(Y|X) = \beta_{0}+\beta_{1} X$$` - A sample will consist of `\(n\)` observations `\(X_{i}\)` and `\(Y_{i}\)` - Each pair idependently and identically distributed with `$$Y_{i} = \beta_{0} + \beta_{1} X_{i} + u_{i}$$` - `\(u_{i}\)` is the _error term_ with `$$E(u_{i}|X_{i}) = 0$$` --- ## Units and size - load data - Regression coefficients depend on the scale of variables - Define `m_length` as length in meters (instead of centimeters) ```r lengths = mutate(lengths, m_length = length/100 , female = as.numeric(gender == "Female") # Make numeric var ) sum_up(lengths) ``` ``` Variable β Obs Missing Mean StdDev Min Max ββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββ female β 23 0 0.3913 0.49901 0 1 length β 23 0 172.435 12.8056 140 192 m_length β 23 0 1.72435 0.12806 1.4 1.92 weight β 23 0 68.087 16.8224 30 95 ``` --- ## Rescaling regression ```r r1 = lm(weight ~ length, data = lengths) r2 = lm(weight ~ m_length, data = lengths) ``` * Increase in weight of a change in length in meters is 100 times bigger ```r huxreg(r1, r2) ```
(1)
(2)
(Intercept)
-119.761 ***
-119.761 ***
(27.697)
(27.697)
length
1.089 ***
(0.160)
m_length
108.939 ***
(16.020)
N
23
23
R2
0.688
0.688
logLik
-83.664
-83.664
AIC
173.327
173.327
*** p < 0.001; ** p < 0.01; * p < 0.05.
--- ## Standardizing variables - Subtract mean to get zero mean variable (demeaning) - Let `\(W = Y - \mu_Y\)` $$ E[W] = E[Y - \mu_Y] = E[Y] - E[\mu_Y] = \mu_Y - \mu_Y = 0$$ - Divide by standard deviation to get var with unit variance - Let `\(U = \frac{Y}{\sigma_Y}\)`. As `\(Var[aY] = a^2 Var[Y]\)` for any number `\(a\)`, we have: $$ Var[U] = Var[ \frac{Y}{\sigma_Y}] = Var[ \frac{1}{\sigma_Y}Y] = \frac{1}{\sigma^2_Y} Var[ Y] = 1$$ - Thus for any random variable `\(Y\)`, the _standardized_ variable `$$\frac{Y-\mu_Y}{\sigma_Y}$$` has mean 0 and variance 1. --- ## Standardized regression * Dividing variables by their standard deviation creates variable with unit variance ```r lengths2 = mutate(lengths, s_length = length / sd(length) , s_weight = weight / sd(weight) ) sum_up( lengths2) ``` ``` Variable β Obs Missing Mean StdDev Min Max ββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββ female β 23 0 0.3913 0.49901 0 1 length β 23 0 172.435 12.8056 140 192 m_length β 23 0 1.72435 0.12806 1.4 1.92 s_length β 23 0 13.4655 1 10.9327 14.9934 s_weight β 23 0 4.04741 1 1.78334 5.64724 weight β 23 0 68.087 16.8224 30 95 ``` --- ## Standardising and correlation coefficent - How variability in `\(X\)` relates to variability in `\(Y\)` ```r rg = lm( s_weight ~ s_length, data = lengths2) kable(tidy(rg)) ``` |term | estimate| std.error| statistic| p.value| |:-----------|----------:|---------:|---------:|---------:| |(Intercept) | -7.1191700| 1.6464561| -4.323936| 0.0002995| |s_length | 0.8292704| 0.1219506| 6.800054| 0.0000010| ```r kable(correlate(select(lengths, length, weight))) ``` |term | length| weight| |:------|---------:|---------:| |length | NA| 0.8292704| |weight | 0.8292704| NA| --- class: inverse, center, middle # Causation --- ## Correlation is not causation Correlation between `\(X_i\)` and `\(Y_i\)` can be due to 1. `\(X_i\)` causes `\(Y_i\)` 2. `\(Y_i\)` causes `\(X_i\)` 3. `\(X_i\)` causes `\(Y_i\)` and `\(Y_i\)` causes `\(X_i\)` - *Self reinforcing system* / *simultaneity* 4. `\(W_i\)` causes both `\(X_i\)` and `\(Y_i\)` - *Spurious relationship* - `\(W_i\)` is a *confounding factor* / *lurking relationship* - `\(W_i\)` is often time 5. `\(X_i\)` and `\(Y_i\)` are independent - *Coincidence in data* - If you look long enough you will find patterns --- ## Spurious correlation (Pearson) - Normalizing `\(X_i\)` and `\(Y_i\)` by `\(Z_i\)` causes `\(X_i\)` and `\(Y_i\)` to become correlated - Example: dividing by population to get per capita data - *Spurious correlation* sometimes used more generally - βSpurious correlationβ: <http://www.tylervigen.com/> --- ## [Coincidence in data](http://rstudio.sh.se/content/intro09-figs/) <sup> π </sup> .pull-left[ * Estimate relationship when there is none * With 95% confidence level, we get significant coefficent 1/20 of the time * Regardless of sample size! * But the size of the coefficient will be smaller ] .pull-right[ ![](lecture05_files/figure-html/unnamed-chunk-13-1.png)<!-- --> ] --- ## Statistical and economic significance * Statistical significance does not imply that the effect is important * Discuss economic significance of coefficent * If you look long enough you will find a pattern * Insignificant results can be as interesting * Saying that something can only have a small effect if any --- ## Economic significance * Judgment of whether the coefficient captures important effect * Effect can be statistically significant but have a very small effect * What is important depends on the setting, for example a reasonable policy change in the explanatory variable * Ex: How much of the difference in test scores is due to differences in class size? * Can require a bit of calculation based on estimated coefficients * Summary statistics are essential in understanding! --- ## [Tradeoff bias and precision](http://rstudio.sh.se/content/statistics05-figs#section-omitted) <sup> π </sup> ``` =================================================== X 1.037*** 1.038*** (0.031) (0.030) W -0.008 (0.032) Constant 1.024*** 1.025*** (0.032) (0.031) --------------------------------------------------- Observations 50 50 R2 0.961 0.961 Adjusted R2 0.960 0.961 Residual Std. Error 0.224 (df = 47) 0.221 (df = 48) =================================================== Note: *p<0.1; **p<0.05; ***p<0.01 ``` --- ## Omitted variable - Correlation Inclusion/omission of `\(W\)` depends on correlation and on whether it is in the population equation. Correlation `\(X_i\)` and `\(W_i\)` | `\(\beta_W\)` | Included | Omitted ------------ | ---| ----------- | -------------- Uncorrelated | `\(\beta_W = 0\)` | | Correlated | `\(\beta_W = 0\)` | More uncertain | Uncorrelated | `\(\beta_W \neq 0\)` | | More uncertain Correlated | `\(\beta_W \neq 0\)` | | __Biased and Inconsistent__ --- class: inverse, center, middle # Pipes ![](figures/MagrittePipe.jpg) --- ## _Pipes_ in the tidyverse * Not really needed but makes code simpler - used in documentation * Often we want to take a dataset and perform several steps in order * Pipe operator ` %>% ` facilitates ```r select(lengths, length, weight) ``` can be written as ```r lengths %>% select(length, weight) ``` * Pipe means: put the left hand side as the first argument in the function on the right hand side * With several steps much easier to read can be written as ```r lengths %>% filter(gender == "Female") %>% select(length, weight) ``` --- ## Selecting, correlating and formatting with pipe ```r len2 = select(lengths, length, weight) len3 = correlate(len2) kable(len3) ``` * Same code - hard to read ```r kable(correlate(select(lengths, length, weight))) ``` * Same code - with pipes: `%>%` ```r lengths %>% select(length, weight) %>% correlate() %>% kable() ```