Econometrics B/C - Lecture 5

class: center, middle, inverse, title-slide

# Econometrics B/C - Lecture 5
## Sampling and uncertainty
### Jonas Björnerstedt
### 2021-10-26

---

## Lecture Content

Chapter 6 in text

1. Sampling and uncertainty

1. Mean - coin toss
  
  2. Regression

3. Heteroskedasticity
  
2. Causation

3. Data pipes

---
class: inverse, center, middle

# Sampling and uncertainty

---
## Coin toss example

- A _fair coin_ has equal probability of heads and tails

- How do we deterimine if a coin is fair?

1. Toss the coin many times

1. Assign 1 if outcome is heads and -1 if tails

1. Take the average

1. Check if the average is close to zero

- But how can be sure it is fair if for example the average is 0.1 instead of zero?
   - How sure are we?

---
## Distribution of mean

- What is the relationship between the expected value in the population and sample means taken from the population?

- Can we say how close the average height of 100 randomly sampled people is to the mean value in the population (the expected value)?

- Two approaches:

- Calculate the distribution of the mean

- The Central Limit Theorem (CLT)

---
## Design of a test

- Select a _null hypothesis_ `$H_0$`

- Create a statistic with a known distribution under `$H_0$`

- Calculate how likely the outcome is given the data

- Reject `$H_0$` if the probability of the actual outcome occurring is below a chosen critical level

---
## Exact distribution

- If we know the distribution of the random variable `$X$`, we can calculate the distribution of `$\bar X$`

- Binomial distribution of fair coin
  
  - Mean has distribution that we can calculate
  
- Given this calculated distribution, we can see how likely the mean in our sample is

- Reject `$H_0$` if it is _very_ unlikely (given some choice of threshold)

---
## [Coin toss - distribution](http://rstudio.sh.se/content/statistics02-figs#section-distribution) 🔗

.pull-left[

Outcome | Probability |
---------: | -------------: | 
1  | 1/2  
-1  | 1/2

* Expected value:

`$$E(Y) = -1*0.5 + 1*0.5 = 0$$`
* Variance:

`$$Var(Y) = (-1)^2*0.5 + (1)^2*0.5 = 1$$`

]
.pull-right[

![](lecture05_files/figure-html/unnamed-chunk-1-1.png)
]

---
## [Distribution of mean `$\bar Y$` - 2 obs](http://rstudio.sh.se/content/statistics02-figs#section-distribution) 🔗

.pull-left[

First | Second | Mean |
---------: | -------------: | ------:|
1  | 1 | 1 
1  | -1 | 0
-1  | 1 | 0
-1  | -1 | -1

* Expected value:

`$$E(\bar Y) = -1*\frac{1}{4} + 1*\frac{1}{4} + 0*\frac{1}{2} = 0$$`

]
.pull-right[

![](lecture05_files/figure-html/unnamed-chunk-2-1.png)
]

* Variance:

`$$Var(\bar Y) = (-1-0)^2 \frac{1}{4} + (1-0)^2*\frac{1}{4} + (0-0)^2\frac{1}{2} = \frac{1}{2}$$`

---
## Law of large numbers

- Taking means with more draws gives more certainty

---
## [Almost normal distribution](http://rstudio.sh.se/content/statistics02-figs#section-almost-normal) 🔗

![](lecture05_files/figure-html/unnamed-chunk-3-1.png)

---
## Central Limit Theorem (CLT)

- Properties of the mean `$\bar Y$` of random samples of `$Y$`

- Central Limit Theorem (CLT)

- Take mean of independent random draws of `$Y$`

- Draws can have (almost) any distribution
    
1. Large sample imply that mean is almost normal

- Normal distribution is characterized by mean and variance

2. Mean will be close to expected value

3. Mean will vary due to sampling - _Standard error_ of `$\bar Y$`
  
  - Variance proportional to variance of `$Y$`
  
  - Variance inversely proportional to sample size `$n$`

`$$Var(\bar Y) = \frac{Var(Y)}{n}$$`

---
## Sampling uncertainty

Example: estimating the average height of the population

- Mean of sample - our estimate of population mean

- Can observe how heights differ around the estimated mean

- Estimate of the variance of the population height in sample

- Can calculate the how the mean will vary due to sampling

- Can estimate how much sampling affects estimate

- Uncertainty of estimated mean

---
## Standard deviation and standard error

- Can calculate the _standard error_
  
  - Square root of `$Var(\bar Y)$`

`$$SE(\bar Y) = \sqrt{Var(\bar Y)} = \sqrt{\frac{Var(Y)}{n} } = \frac{SD(Y)}{\sqrt{n} }$$`

* As `$\bar Y$` has almost normal distribution

* our estimate will vary due to sampling

* 95% of the draws will be within `$2SE(\bar Y)$` however
  
  * Confidence interval!

---
## Estimating the population average

```r
library(statar)
sum_up(select(lengths, length)) # Summary statistics of lengths 
```

```
 
Variable │      Obs  Missing     Mean   StdDev      Min      Max 
─────────┼───────────────────────────────────────────────────────
  length │       23        0  172.435  12.8056      140      192 
```

- The simplest regression. Formula `length ~ 1` used to estimate __only__  intercept

```r
lm_robust(length ~ 1, data = lengths)
```

```
            Estimate Std. Error  t value     Pr(>|t|) CI Lower CI Upper DF
(Intercept) 172.4348   2.670159 64.57848 1.400958e-26 166.8972 177.9724 22
```

* Same coefficient 172.4348
* Standard error of estimate `$2.6701587 = \frac{SD(lengths)}{\sqrt{n}} = \frac{12.8056312}{\sqrt{23}}$` 
* We can calculate a __confidence interval__!

---
## Population equation

- How does `$Y$` depend on `$X$`?

- Cannot hope to fully describe the relationship

- Focus on the *conditional expectation*:

- How does the *expected value* of `$Y$` depend on `$X$`?

- To do this, we want a function `$f(X)$` such that
`$$E(Y|X)=f(X)$$`

- In linear models `$f$` is given by the _population regression_ line
    `$$E(Y|X) = \beta_{0}+\beta_{1} X$$`

- `$\beta_{0}$` and `$\beta_{1}$` are _parameters_ in the model

---
## Linear regression model

- Given a _population regression_ function
    `$$E(Y|X) = \beta_{0}+\beta_{1} X$$`

- A sample will consist of `$n$` observations `$X_{i}$` and `$Y_{i}$`

- Each pair idependently and identically distributed with

`$$Y_{i} = \beta_{0} + \beta_{1} X_{i} + u_{i}$$`

- `$u_{i}$` is the _error term_ with
`$$E(u_{i}|X_{i}) = 0$$`

---
## Units and size - load data

- Regression coefficients depend on the scale of variables

- Define `m_length` as length in meters (instead of centimeters)

```r
lengths = mutate(lengths, 
     m_length = length/100 , 
     female = as.numeric(gender == "Female") # Make numeric var
     )
sum_up(lengths)
```

```
 
Variable │      Obs  Missing     Mean   StdDev      Min      Max 
─────────┼───────────────────────────────────────────────────────
  female │       23        0   0.3913  0.49901        0        1 
  length │       23        0  172.435  12.8056      140      192 
m_length │       23        0  1.72435  0.12806      1.4     1.92 
  weight │       23        0   68.087  16.8224       30       95 
```

---
## Rescaling regression

```r
r1 = lm(weight ~ length, data = lengths) 
r2 = lm(weight ~ m_length, data = lengths)
```

* Increase in weight of a change in length in meters is 100 times bigger

```r
huxreg(r1, r2)
```

<table class="huxtable" style="border-collapse: collapse; border: 0px; margin-bottom: 2em; margin-top: 2em; ; margin-left: auto; margin-right: auto; " id="tab:unnamed-chunk-9">
<col><col><col><tr>
<th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0.4pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(1)</th><th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0.4pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(2)</th></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(Intercept)</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-119.761 ***</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-119.761 ***</td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(27.697)   </td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(27.697)   </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">length</th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">1.089 ***</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">        </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(0.160)   </td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">        </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">m_length</th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">        </td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">108.939 ***</td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">        </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(16.020)   </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">N</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">23        </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">23        </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">R2</th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.688    </td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.688    </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">logLik</th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-83.664    </td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-83.664    </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">AIC</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">173.327    </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">173.327    </td></tr>
<tr>
<th colspan="3" style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"> *** p < 0.001; ** p < 0.01; * p < 0.05.</th></tr>
</table>

---
## Standardizing variables

- Subtract mean to get zero mean variable (demeaning)

- Let `$W = Y - \mu_Y$`

$$ E[W] = E[Y - \mu_Y] = E[Y] - E[\mu_Y] = \mu_Y - \mu_Y = 0$$

- Divide by standard deviation to get var with unit variance

- Let `$U = \frac{Y}{\sigma_Y}$`. As `$Var[aY] = a^2 Var[Y]$` for any number `$a$`, we have: 
$$ Var[U] = Var[ \frac{Y}{\sigma_Y}] = Var[ \frac{1}{\sigma_Y}Y] = \frac{1}{\sigma^2_Y} Var[ Y] = 1$$

- Thus for any random variable `$Y$`, the _standardized_ variable
`$$\frac{Y-\mu_Y}{\sigma_Y}$$`
has mean 0 and variance 1.

---
## Standardized regression

* Dividing variables by their standard deviation creates variable with unit variance

```r
lengths2 = mutate(lengths, 
      s_length = length / sd(length) , 
      s_weight = weight / sd(weight) 
      )
sum_up( lengths2)
```

```
 
Variable │      Obs  Missing     Mean   StdDev      Min      Max 
─────────┼───────────────────────────────────────────────────────
  female │       23        0   0.3913  0.49901        0        1 
  length │       23        0  172.435  12.8056      140      192 
m_length │       23        0  1.72435  0.12806      1.4     1.92 
s_length │       23        0  13.4655        1  10.9327  14.9934 
s_weight │       23        0  4.04741        1  1.78334  5.64724 
  weight │       23        0   68.087  16.8224       30       95 
```

---
## Standardising  and correlation coefficent

- How variability in `$X$` relates to variability in `$Y$`

```r
rg = lm( s_weight ~ s_length, data = lengths2)
kable(tidy(rg))
```

|term        |   estimate| std.error| statistic|   p.value|
|:-----------|----------:|---------:|---------:|---------:|
|(Intercept) | -7.1191700| 1.6464561| -4.323936| 0.0002995|
|s_length    |  0.8292704| 0.1219506|  6.800054| 0.0000010|

```r
kable(correlate(select(lengths, length, weight)))
```

|term   |    length|    weight|
|:------|---------:|---------:|
|length |        NA| 0.8292704|
|weight | 0.8292704|        NA|

---
class: inverse, center, middle

# Causation

---
## Correlation is not causation
    
Correlation between `$X_i$` and `$Y_i$` can be due to

1.  `$X_i$` causes `$Y_i$`

2.  `$Y_i$` causes `$X_i$`

3.  `$X_i$` causes `$Y_i$` and `$Y_i$` causes `$X_i$` - *Self reinforcing system* / *simultaneity*

4.  `$W_i$` causes both `$X_i$` and `$Y_i$` - *Spurious relationship*

- `$W_i$` is a *confounding factor* / *lurking relationship*

- `$W_i$` is often time

5.  `$X_i$` and `$Y_i$` are independent - *Coincidence in data*

- If you look long enough you will find patterns

---
## Spurious correlation (Pearson)

- Normalizing `$X_i$` and `$Y_i$` by `$Z_i$` causes `$X_i$` and `$Y_i$` to become correlated

- Example: dividing by population to get per capita data

- *Spurious correlation* sometimes used more generally

- “Spurious correlation”: <http://www.tylervigen.com/>

---
## [Coincidence in data](http://rstudio.sh.se/content/intro09-figs/) 🔗

.pull-left[ 
* Estimate relationship when there is none

* With 95% confidence level, we get significant coefficent 1/20 of the time

* Regardless of sample size!

* But the size of the coefficient will be smaller
]
.pull-right[

![](lecture05_files/figure-html/unnamed-chunk-13-1.png)
]

---
## Statistical and economic significance

* Statistical significance does not imply that the effect is important

* Discuss economic significance of coefficent

* If you look long enough you will find a pattern

* Insignificant results can be as interesting

* Saying that something can only have a small effect if any

---
## Economic significance

* Judgment of whether the coefficient captures important effect

* Effect can be statistically significant but have a very small effect 
  
* What is important depends on the setting, for example a reasonable policy change in the explanatory variable

* Ex: How much of the difference in test scores is due to differences in class size?
  
  * Can require a bit of calculation based on estimated coefficients
  
* Summary statistics are essential in understanding!

---
## [Tradeoff bias and precision](http://rstudio.sh.se/content/statistics05-figs#section-omitted) 🔗

```

===================================================
X 1.037*** 1.038*** 
 (0.031) (0.030) 
 
W -0.008 
 (0.032) 
 
Constant 1.024*** 1.025*** 
 (0.032) (0.031) 
 
---------------------------------------------------
Observations 50 50 
R2 0.961 0.961 
Adjusted R2 0.960 0.961 
Residual Std. Error 0.224 (df = 47) 0.221 (df = 48)
===================================================
Note: *p<0.1; **p<0.05; ***p<0.01
```

---
## Omitted variable - Correlation

Inclusion/omission of `$W$` depends on correlation and on whether it is in the population equation.

Correlation `$X_i$` and `$W_i$`         | `$\beta_W$`        | Included     | Omitted   
------------ | ---| ----------- | --------------
Uncorrelated | `$\beta_W = 0$` |  | 
Correlated   | `$\beta_W = 0$`    |  More uncertain | 
Uncorrelated | `$\beta_W \neq 0$` | | More uncertain  
Correlated   | `$\beta_W \neq 0$` |  | __Biased and Inconsistent__

---
class: inverse, center, middle
# Pipes

![](figures/MagrittePipe.jpg)

---
## _Pipes_ in the tidyverse

* Not really needed but makes code simpler - used in documentation

* Often we want to take a dataset and perform several steps in order

* Pipe operator ` %>% ` facilitates

```r
select(lengths, length, weight)
```
can be written as

```r
lengths %>% select(length, weight)
```

* Pipe means: put the left hand side as the first argument in the function on the right hand side

* With several steps much easier to read
can be written as

```r
lengths %>% filter(gender == "Female") %>% select(length, weight)
```

---
## Selecting, correlating and formatting with pipe

```r
len2 =  select(lengths, length, weight)
len3 = correlate(len2)
kable(len3)
```

* Same code - hard to read

```r
kable(correlate(select(lengths, length, weight)))
```

* Same code - with pipes: `%>%`

```r
lengths %>% select(length, weight) %>% correlate() %>% kable()
```