class: center, middle, inverse, title-slide .title[ # Econometrics - Lecture 4 ] .subtitle[ ## Ordinary Least Squares ] .author[ ### Jonas Björnerstedt ] .date[ ### 2024-11-11 ] --- ## Lecture Content 1. OLS derivations 1. Dummy variables 2. Heteroscedasticity --- ## [Transform a random variable<sup> 🔗 </sup>](http://rstudio.sh.se/content/statistics01-figs.Rmd#section-correlation) - From a random variable `\(Y\)` we can create new random variables: * `\(2Y\)` stretches * `\(Y + 1\)` moves --- ## Properties of expectations `$$E(aY) = aE(Y)$$` - To show this for discrete random var, use definition `$$E(aY) = \sum_{i=1}^k (aY_i) p_i= a \sum_{i=1}^k Y_i p_i = aE(Y)$$` - Similarly, one can show that `$$E(X + Y) = E(X) + E(Y)$$` --- ## Properties of variance `$$Var(aX) = a^2 Var(X)$$` - We can show this using the definitions: `$$Var(aX) = E[(aX-E(aX))^2] = E[(aX-aE(X))^2] = E[a^2(X-E(X))^2]$$` - Thus `$$Var(aX) = E[a^2(X-E(X))^2] = a^2E[(X-E(X))^2] = a^2 Var(X)$$` --- ## Variance of `\(X + Y\)` - For independent `\(X\)` and `\(Y\)`: `$$Var(X + Y) = Var(X) + Var(Y)$$` - For simplicity, assume that `\(X\)` and `\(Y\)` have zero mean, i.e. `\(\mu_X = \mu_Y = 0\)` - The calculations become slightly more messy otherwise `$$Var(X + Y) = E[(X + Y - \mu_X - \mu_Y)^2] = E[(X + Y)^2]$$` - We know that `\((X + Y)^2 = (X + Y)(X + Y) = X^2 + 2XY + Y^2\)`. Thus `$$E[(X + Y)^2] = E[X^2 + 2XY + Y^2] = E[X^2] + 2E[XY] + E[Y^2]$$` - If `\(X\)` and `\(Y\)` are independent, they are uncorrelated, with `\(E[XY] = 0\)`. Thus `$$Var(X + Y) = E[X^2]+ E[Y^2] = Var(X) + Var(Y)$$` --- ## Expected value and variance of mean - What is the variance of the mean of two observations? `$$\bar Y = \frac{Y_1+Y_2}{2}$$` `$$E[\bar Y] = E[ \frac{Y_1+Y_2}{2}]= \frac{1}{2}(E[Y_1+Y_2] ) = \frac{1}{2}(E[Y_1]+E[Y_2] ) = E[Y]$$` $$Var(\frac{Y_1+Y_2}{2}) = \frac{1}{4} Var(Y_1+Y_2) = \frac{1}{4} \left( Var(Y_1)+Var(Y_2)\right) $$ * Thus with independent sampling we have: `$$Var(\bar Y) = Var(\frac{Y_1+Y_2}{2}) = \frac{1}{2} Var(Y_i)$$` * With `\(n\)` observations we have: `$$Var(\bar Y) = Var \left( \frac{Y_1+Y_2+ \ldots +Y_n}{n} \right) = \frac{1}{n} Var(Y_i)$$` --- ## Units and size - [Download lengthdata dataset](https://rstudio.sh.se/ts/lengthdata.rds) - Regression coefficients depend on the scale of variables ```r len = readRDS("lengthdata.rds") len$m_length = len$length/100 # Length in meters len$g_weight = len$weight*1000 # Weight in grams ``` | length| weight|gender | female| m_length| g_weight| |------:|------:|:------|------:|--------:|--------:| | 186| 82|Male | 0| 1.86| 82000| | 173| 54|Female | 1| 1.73| 54000| | 168| 52|Female | 1| 1.68| 52000| | 175| 79|Male | 0| 1.75| 79000| | 192| 85|Male | 0| 1.92| 85000| | 174| 75|Female | 1| 1.74| 75000| --- ## Rescaling regression ```r r = lm(weight ~ length, data = len) ``` |term | estimate| std.error| statistic| p.value| |:-----------|--------:|---------:|---------:|-------:| |(Intercept) | -97.78| 19.37| -5.05| 0| |length | 0.97| 0.11| 8.56| 0| ```r r = lm( g_weight ~ m_length, data = len) ``` |term | estimate| std.error| statistic| p.value| |:-----------|---------:|---------:|---------:|-------:| |(Intercept) | -97775.98| 19368.85| -5.05| 0| |m_length | 96506.20| 11269.98| 8.56| 0| * Increase in weight in grams of a change in length in meters is __100 000 bigger__ --- ## Summary statistics * Summary statistics essential in order to underatand data * The `vtable` package has a summary statistics function ```r library(vtable) st(len) ``` <table class="table" style="margin-left: auto; margin-right: auto;"> <caption>Summary Statistics</caption> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:left;"> N </th> <th style="text-align:left;"> Mean </th> <th style="text-align:left;"> Std. Dev. </th> <th style="text-align:left;"> Min </th> <th style="text-align:left;"> Pctl. 25 </th> <th style="text-align:left;"> Pctl. 75 </th> <th style="text-align:left;"> Max </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> length </td> <td style="text-align:left;"> 43 </td> <td style="text-align:left;"> 171 </td> <td style="text-align:left;"> 11 </td> <td style="text-align:left;"> 150 </td> <td style="text-align:left;"> 163 </td> <td style="text-align:left;"> 178 </td> <td style="text-align:left;"> 192 </td> </tr> <tr> <td style="text-align:left;"> weight </td> <td style="text-align:left;"> 43 </td> <td style="text-align:left;"> 68 </td> <td style="text-align:left;"> 14 </td> <td style="text-align:left;"> 47 </td> <td style="text-align:left;"> 55 </td> <td style="text-align:left;"> 78 </td> <td style="text-align:left;"> 95 </td> </tr> <tr> <td style="text-align:left;"> gender </td> <td style="text-align:left;"> 43 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> ... Female </td> <td style="text-align:left;"> 21 </td> <td style="text-align:left;"> 49% </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> ... Male </td> <td style="text-align:left;"> 22 </td> <td style="text-align:left;"> 51% </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> </tbody> </table> --- ## Standardizing variables - Subtract mean to get zero mean variable (demeaning) - Let `\(W = Y - \mu_Y\)` $$ E[W] = E[Y - \mu_Y] = E[Y] - E[\mu_Y] = \mu_Y - \mu_Y = 0$$ - Divide by standard deviation to get var with unit variance - Let `\(U = \frac{Y}{\sigma_Y}\)`. As `\(Var[aY] = a^2 Var[Y]\)` for any number `\(a\)`, we have: $$ Var[U] = Var[ \frac{Y}{\sigma_Y}] = Var[ \frac{1}{\sigma_Y}Y] = \frac{1}{\sigma^2_Y} Var[ Y] = 1$$ - Thus for any random variable `\(Y\)`, the _standardized_ variable `$$\frac{Y-\mu_Y}{\sigma_Y}$$` has mean 0 and variance 1. --- ## Standardized regression ```r len$s_length = (len$length - mean(len$length))/sd(len$length) len$s_weight = (len$weight - mean(len$weight))/sd(len$weight) sr = lm( s_weight ~ s_length, data = len) library(broom) tidy(sr) ```
term
estimate
std.error
statistic
p.value
(Intercept)
-3.57e-16
0.0924
-3.86e-15
1
s_length
0.801
0.0935
8.56
1.13e-10
--- ## Standardising and correlation coefficent - How variability in `\(X\)` relates to variability in `\(Y\)`
term
estimate
std.error
statistic
p.value
(Intercept)
-3.57e-16
0.0924
-3.86e-15
1
s_length
0.801
0.0935
8.56
1.13e-10
```r library(corrr) df = select(len, length, weight) correlate(df) ```
term
length
weight
length
0.801
weight
0.801
--- class: inverse, center, middle # OLS --- ## The Least squares assumption - Assumption 1: Cond distribution of `\(u\)` has mean zero - Relationship between `\(X_i\)` and `\(u_i\)` has to be specified - Assumption 2: Observations are IID - `\(X_i\)` and `\(Y_i\)` independent of `\(X_j\)` and `\(Y_j\)` - Assumption 3: Large outliers are unlikely - To ensure that variance can be estimated - Use of the OLS assumptions - Estimation of coefficients and their variance - Unbiased and consistent estimates --- ## Properties of estimators 1. _Unbiased_ - Corresponds to population parameter in expectation - Both `\(\bar Y\)` and `\(Y_1\)` are unbiased estimates of `\(\mu_Y\)` 2. _Consistent_ - Converges in probability to `\(\mu_Y\)` as sample size increases to infinity - `\(\bar Y\)` but not `\(Y_1\)` are consistent - Law of large numbers - Note that a consistent estimator can be biased 3. _Efficient_ - Uncertainty (variance) in estimate lower than alternatives - `\(\bar Y\)` efficient but not `\(Y_1\)` --- ## Best Linear Unbiased Estimator (BLUE) - Unbiased: `\(E\left(\hat\beta_{0}\right)=\beta_{0}\)` and `\(E\left(\hat\beta_{1}\right)=\beta_{1}\)` - Consistency (no asymptotic bias) - Convergence (in probability) to true value as sample size `\(N\rightarrow\infty\)` - Efficiency / best linear estimator - No other unbiased linear estimator has lower variance - Unbiased variance estimate: `\(E\left(s^{2}\right)=\sigma^{2}\)` - Unbiased standard errors --- ## Zero conditional mean - To ensure that we have a linear model, we assume that `\(E(u_{i}|X_{i}) = 0\)` - The expected value of `\(u_i\)` does not depend on `\(X_i\)` - But other aspects of the distribution of `\(u\)` could depend on `\(X\)` - Horizontal variation `\(X_i\)` and vertical `\(u_i\)` are not too related --- ## OLS calculations - Population equation: `\(Y_i = \beta_0 + \beta_1 X_i + u_i\)` - we have `\(E(Y_i) = E(\beta_0 +\beta_1 X_i + u_i) = \beta_0 +\beta_1 E(X_i) + E(u_i)\)` - Can assume that `\(E(u_i) = 0\)`, as `\(\beta_0\)` captures the constant part of the conditional expectation - We thus have `\(E(Y_i) = \beta_0 +\beta_1 E(X_i)\)` - OLS assumption: `\(X_i\)` is uncorrelated with `\(u_i\)`, i.e. `\(E(X_iu_i) = 0\)`. --- ## Simplest regression - We consider now variables with `\(E(X_i) = 0\)` and `\(E(Y_i) = 0\)` - Simplifies calculations - As `\(E(Y_i) = \beta_0 +\beta_1 E(X_i)\)` - As `\(0 = \beta_0 +\beta_1 0 = \beta_0\)` - In this case the population equation has only one parameter `\(\beta_1\)` $$ Y_i = \beta_1X_i + u_i$$ --- ## Estimator - Assume that `\(X_i\)` and `\(u_i\)` are uncorrelated: - Then `$$E[X_i u_i] = 0 = E[X_i(Y_i - \beta_1 X_i))] = E[X_iY_i] - \beta_1 E[X_i X_i]$$` - solving for `$$\beta_1 = \frac{E[X_i Y_i]}{E[X_i X_i]} = \frac{\sigma_{XY}}{\sigma^2_{X}}$$` - In the sample we can derive the corresponding equation, with no correlation in sample: `$$\hat\beta_1 = \frac{s_{XY}}{s^2_{X}} = \frac{\sum_i^n X_i Y_i }{\sum_i^n (X_i)^2}$$` * It can be shown that `\(\hat\beta_1 \rightarrow \beta_1\)` as sample increases --- ## Regression and correlation - The correlation coefficient is just a rescaling of `\(\beta_1\)` - Consider `\(V = \frac{Y}{\sigma_{Y}}\)` `\(Z = \frac{X}{\sigma_{X}}\)` - A simple rescaling of `\(X\)` and `\(Y\)` to variables with unit variance Rewrite: `\(Y = \beta_0 + \beta_1 X + u\)` `$$\frac{Y}{\sigma_{Y}} = \frac{\beta_0}{\sigma_{Y}} + \beta_1 \frac{\sigma_{X}}{\sigma_{Y}} \frac{X}{\sigma_{X}} + \frac{u}{\sigma_{Y}}$$` Since `\(\beta_1 = \frac{\sigma_{XY}}{\sigma^2_{X}}\)` this means that: `$$V = \frac{\beta_0}{\sigma_{Y}} + \beta_1 \frac{\sigma_{X}}{\sigma_{Y}} Z + \frac{u}{\sigma_{Y}}$$` In terms of `\(V\)` and `\(V\)`, the relationship is: `$$V = \alpha_0 + \alpha_1 Z + v$$` --- ## Correlation as a linerar relationship We have `$$V = \alpha_0 + \alpha_1 Z + v$$` with `\(\alpha_1 = \beta_1 \frac{\sigma_{X}}{\sigma_{Y}}\)`. But `\(\beta_1 = \frac{\sigma_{XY}}{\sigma^2_{X}}\)` and thus `$$\alpha_1 = \beta_1 \frac{\sigma_{X}}{\sigma_{Y}} = \frac{\sigma_{XY}}{\sigma^2_{X}}\frac{\sigma_{X}}{\sigma_{Y}} = \frac{\sigma_{XY}}{\sigma_{X}\sigma_{Y}} = \rho_{XY}$$` Thus in terms of the rescaled variables, the population relationship is: `$$V = \alpha_0 + \rho_{XY} Z + v$$` - The coefficient `\(\beta_1\)` is a rescaling of the correlation coefficient `\(\rho_{XY}\)` - `\(\beta_1 = 0\)` if and only if `\(X\)` and `\(Y\)` are uncorrelated, ie: `\(\rho_{XY} = 0\)` --- ## Dummy variable regression - [Employment data](https://rstudio.sh.se/ts/employment_06_07.rds): Line from mean for `\(female = 0\)` to mean for `\(female = 1\)` - Slope corresponds to the lower average wage for women ![](statistics04_files/figure-html/unnamed-chunk-12-1.png)<!-- --> --- ## Dummy variable regression - Same plot using _jitter_ (moving points slightly horizontally) - Data points overlap less if distplayed slightly moved ![](statistics04_files/figure-html/unnamed-chunk-13-1.png)<!-- --> --- ## Empirical exercise - Open [employment_06_07.rds](https://rstudio.sh.se/ts/employment_06_07.rds): - Compare average for population and for female ```r employment = read_rds("../data/employment_06_07.rds") lm(earnwke~1, data = employment) summary(employment$earnwke) empf = filter(employment, female == 1) lm(earnwke~female, data = employment) lm(earnwke~female, data = empf) summary(empf$earnwke) ``` --- class: inverse, center, middle # Heteroscedasticity --- ## Heteroscedasticity - Variability is often higher at higher values - Same percentage variability - Does not affect estimate - Confidence intervals change - Variance covariance is usually too low - Often increased variability for `\(X_i\)` far from mean `\(\bar X\)` - Variability at extremes results in more uncertainty than variability at the mean --- ## Heteroscedasticity plots - Same data in both examples - First and second half switch places - Notice larger uncertainty (gray area) in second figure - Observations with `\(X_{i}\)` far from mean `\(\bar X\)` are more influential - Variability far from mean increases uncertainty more --- ## Variability at the center ![](figures/hetcenter.png) --- ## Variability at the edges ![](figures/hetedges.png) --- ## Breusch-Pagan specification test - Estimate model - Squared residuals `\(\hat u_{i}^2\)` are generated - Regess to see if errors are linearly dependent on variables `$$\hat u_{i}^{2}=\alpha_{0}+\alpha_{1}X_{i}+v_{i}$$` - test to see if `\(\hat \alpha_{1}=0\)` - Null hypothesis is homoscedasticity --- ## Dealing with heteroscedasticity Estimate with _robust_ standard errors - Tends to give larger standard deviations - Better to be cautious... - Unfortunately robust is _not the default in R_ - The `estimatr` package can be used - Will also use the `huxtable` package for regression tables --- ## Next lecture Chapter 6 - multivariate regression