class: center, middle, inverse, title-slide # Econometrics - Lecture 4 ## Hypothesis Tests and Confidence Intervals ### Jonas Björnerstedt ### 2022-03-02 --- ## Lecture Content 1. Dummy variables 2. Heteroscedasticity 3. Testing - Chapter 3 on testing - Chapter 5: Hypotheis tests and confidence intervals - Not included in course: - 5.6 Using the t-statistic --- ## [Transform a random variable<sup> 🔗 </sup>](http://rstudio.sh.se/content/statistics01-figs.Rmd#section-correlation) - From a random variable `\(Y\)` we can create new random variables: * `\(2Y\)` stretches * `\(Y + 1\)` moves --- ## Units and size - load data - Regression coefficients depend on the scale of variables ```r len = readRDS("lengths.rds") len = mutate(len, m_length = length/100 , # Length in meters g_weight = weight*100 # Weight in grams ) ``` | length| weight|gender | m_length| g_weight| |------:|------:|:------|--------:|--------:| | 186| 82|Male | 1.86| 8200| | 173| 54|Female | 1.73| 5400| | 168| 52|Female | 1.68| 5200| | 175| 79|Male | 1.75| 7900| | 192| 85|Male | 1.92| 8500| | 174| 75|Female | 1.74| 7500| --- ## Rescaling regression ```r lm(weight ~ length, data = len) %>% tidy() ```
term
estimate
std.error
statistic
p.value
(Intercept)
-97.8
19.4
-5.05
9.61e-06
length
0.965
0.113
8.56
1.13e-10
```r lm( g_weight ~ m_length, data = len) %>% tidy() ```
term
estimate
std.error
statistic
p.value
(Intercept)
-9.78e+03
1.94e+03
-5.05
9.61e-06
m_length
9.65e+03
1.13e+03
8.56
1.13e-10
* Increase in weight in grams of a change in length in meters is 10000 bigger --- ## Standardized regression ```r len = readRDS("lengths.rds") %>% select(-time) %>% mutate( s_length = (length - mean(length))/sd(length) , s_weight = (weight - mean(weight))/sd(weight) ) lm( s_weight ~ s_length, data = len) %>% tidy() ```
term
estimate
std.error
statistic
p.value
(Intercept)
-3.52e-16
0.0924
-3.81e-15
1
s_length
0.801
0.0935
8.56
1.13e-10
--- ## Standardising and correlation coefficent - How variability in `\(X\)` relates to variability in `\(Y\)`
term
estimate
std.error
statistic
p.value
(Intercept)
-3.52e-16
0.0924
-3.81e-15
1
s_length
0.801
0.0935
8.56
1.13e-10
```r library(corrr) len %>% select(length, weight) %>% correlate(diagonal = 1) ```
term
length
weight
length
1
0.801
weight
0.801
1
--- class: inverse, center, middle # OLS --- ## The Least squares assumption - Assumption 1: Cond distribution of `\(u\)` has mean zero - Relationship between `\(X_i\)` and `\(u_i\)` has to be specified - Assumption 2: Observations are IID - `\(X_i\)` and `\(Y_i\)` independent of `\(X_j\)` and `\(Y_j\)` - Assumption 3: Large outliers are unlikely - To ensure that variance can be estimated - Use of the OLS assumptions - Estimation of coefficients and their variance - Unbiased and consistent estimates --- ## Properties of estimators 1. _Unbiased_ - Corresponds to population parameter in expectation - Both `\(\bar Y\)` and `\(Y_1\)` are unbiased estimates of `\(\mu_Y\)` 2. _Consistent_ - Converges in probability to `\(\mu_Y\)` as sample size increases to infinity - `\(\bar Y\)` but not `\(Y_1\)` are consistent - Law of large numbers - Note that a consistent estimator can be biased 3. _Efficient_ - Uncertainty (variance) in estimate lower than alternatives - `\(\bar Y\)` efficient but not `\(Y_1\)` --- ## Best Linear Unbiased Estimator (BLUE) - Unbiased: `\(E\left(\hat\beta_{0}\right)=\beta_{0}\)` and `\(E\left(\hat\beta_{1}\right)=\beta_{1}\)` - Consistency (no asymptotic bias) - Convergence (in probability) to true value as sample size `\(N\rightarrow\infty\)` - Efficiency / best linear estimator - No other unbiased linear estimator has lower variance - Unbiased variance estimate: `\(E\left(s^{2}\right)=\sigma^{2}\)` - Unbiased standard errors --- ## Zero conditional mean - To ensure that we have a linear model, we assume that `\(E(u_{i}|X_{i}) = 0\)` - The expected value of `\(u_i\)` does not depend on `\(X_i\)` - But other aspects of the distribution of `\(u\)` could depend on `\(X\)` - Horizontal variation `\(X_i\)` and vertical `\(u_i\)` are not too related --- ## [Dummy variable regression<sup> 🔗 </sup>](https://rstudio.sh.se/content/statistics04-figs.Rmd#section-regression) - Line from mean for `\(female = 0\)` to mean for `\(female = 1\)` - Slope corresponds to the lower average wage for women ![](statistics04_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- ## Dummy variable regression - Same plot using _jitter_ (moving points slightly horizontally) - Data points overlap less if distplayed slightly moved ![](statistics04_files/figure-html/unnamed-chunk-9-1.png)<!-- --> --- ## Empirical exercise - Open `employment_06_07.rds`: - Compare average for population and for female ```r employment = read_rds("shared/employment_06_07.rds") lm(earnwke~1, data = employment) mean(employment$earnwke) empf = filter(employment, female == 1) lm(earnwke~female, data = employment) lm(earnwke~female, data = empf) mean(empf$earnwke) ``` --- class: inverse, center, middle # Heteroscedasticity --- ## Heteroscedasticity - Variability is often higher at higher values - Same percentage variability - Does not affect estimate - Confidence intervals change - Variance covariance is usually too low - Often increased variability for `\(X_i\)` far from mean `\(\bar X\)` - Variability at extremes results in more uncertainty than variability at the mean --- ## Heteroscedasticity plots - Same data in both examples - First and second half switch places - Notice larger uncertainty (gray area) in second figure - Observations with `\(X_{i}\)` far from mean `\(\bar X\)` are more influential - Variability far from mean increases uncertainty more --- ## Variability at the center ![](figures/hetcenter.png) --- ## Variability at the edges ![](figures/hetedges.png) --- ## Breusch-Pagan specification test - Estimate model - Squared residuals `\(\hat u_{i}^2\)` are generated - Regess to see if errors are linearly dependent on variables `$$\hat u_{i}^{2}=\alpha_{0}+\alpha_{1}X_{i}+v_{i}$$` - test to see if `\(\hat \alpha_{1}=0\)` - Null hypothesis is homoscedasticity --- ## Dealing with heteroscedasticity Estimate with _robust_ standard errors - Tends to give larger standard deviations - Better to be cautious... - Unfortunately robust is _not the default in R_ - The `estimatr` package can be used - Will also use the `huxtable` package for regression tables --- ## Next lecture Chapter 6 - multivariate regression