class: center, middle, inverse, title-slide # Econometrics B/C - Lecture 2 ## Linear Regression with One Regressor ### Jonas Björnerstedt ### 2021-10-11 --- ## Lecture Content * Chapter 3. Linear regression 1. Linear Regression with One Regressor - Generalize the concept of _mean_ - _Conditional mean_ `\(E(Y|X)\)` of `\(Y\)` - Given `\(X\)`, what is the mean of `\(Y\)`? 1. Dummy variables 2. Heteroskedasticity --- ## Variance and correlation - Relationship between different random variables - Variance: `\(Var(Y)=E[(Y - E(Y))^2]\)` - Expected square distance from mean - Average square distance in sample - Standard deviation `\(SD(Y) = \sqrt{Var(Y)}\)` - Covariance: `$$Cov(X, Y) = E[(X - E(X))(Y- E(Y))]$$` - Expected product of deviations from average for `\(X\)` and `\(Y\)` - Average product of deviations in sample --- ## Correlation coefficent - Normalize covariance with the standard deviation of `\(X\)` and `\(Y\)` `$$\rho_{XY} = \frac{Cov(X, Y)}{SD(X) SD(Y) }$$` - We then have `$$-1 \le\rho_{XY}\le 1$$` --- ## [Correlation - linear relationship](http://rstudio.sh.se/content/statistics03-figs#section-correlation) ``` ## `geom_smooth()` using formula 'y ~ x' ``` ![](lecture02_files/figure-html/unnamed-chunk-1-1.png)<!-- --> --- ## [Correlation and regression](http://rstudio.sh.se/content/statistics03-figs#section-correlation-and-regression) ![](lecture02_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- ## Population equation - How does `\(Y\)` depend on `\(X\)`? - Cannot hope to fully describe the relationship - Focus on the *conditional expectation*: - How does the *expected value* of `\(Y\)` depend on `\(X\)`? - To do this, we want a function `\(f(X)\)` such that `$$E(Y|X)=f(X)$$` - In linear models `\(f\)` is given by the _population regression_ line `$$E(Y|X) = \beta_{0}+\beta_{1} X$$` - `\(\beta_{0}\)` and `\(\beta_{1}\)` are _parameters_ in the model --- ## Linear regression model - Given a _population regression_ function `$$E(Y|X) = \beta_{0}+\beta_{1} X$$` - A sample will consist of `\(n\)` observations `\(X_{i}\)` and `\(Y_{i}\)` - Each pair idependently and identically distributed with `$$Y_{i} = \beta_{0} + \beta_{1} X_{i} + u_{i}$$` - `\(u_{i}\)` is the _error term_ with `$$E(u_{i}|X_{i}) = 0$$` --- ## Ordinary Least Squares (OLS) estimation - Given observations `\(X_{i}\)` and `\(Y_{i}\)` - Find `\(\hat\beta_{0}\)` and `\(\hat\beta_{1}\)` minimizing `$$\sum_{i=1}^n \hat u_{i}^2$$` - where the residuals `\(\hat u_{i}\)` are defined as: `$$\hat u_{i} = Y_{i} - \hat\beta_{0} - \hat\beta_{1} X_{i}$$` - Defines the line with the minimum square distance to the observations --- ## [Line with minimum square distance](http://rstudio.sh.se/content/statistics03-figs#section-least-squares) ![](lecture02_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- ## [Linear regression](http://rstudio.sh.se/content/statistics03-figs#section-linear-regression) ![](lecture02_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- ## Galton's regression - 'Regression to the mean' - tall parents tend to have shorter children - short parents tend to have longer - Can regress in either direction - tall children tend to have shorter parents - A regression is a means of expressing correlation - The regressors do not **cause** the dependent variable to change! - No causation even if relationship is strong --- ## Predicted values - The parameters `\(\hat\beta_{0}\)`, `\(\hat\beta_{1}\)` and `\(\hat u_i\)` fit the data. For all `\(i\)` we have `$$Y_{i} = \hat\beta_{0} + \hat\beta_{1} X_{i} + \hat u_i$$` - The predicted value `\(\hat Y_{i}\)` of the linear model is given by `$$\hat Y_{i} = \hat\beta_{0} + \hat\beta_{1} X_{i}$$` - Out of sample prediction `\(Y\)` can be obtained by inserting values of `\(X\)` not in sample: `$$Y = \hat\beta_{0} + \hat\beta_{1} X$$` --- ## Measures of fit - The `\(R^2\)` statistic is a measure of how much of the variance of `\(Y\)` is explained by `\(X\)` - The sample variance of `\(Y_{i}\)` is given by `$$TSS=\sum_{i=1}^{n} (Y_{i} - \bar Y)^2$$` - The variance of the predicted values `\(\hat Y_{i}\)` is given by `$$ESS=\sum_{i=1}^{n} (\hat Y_{i} - \bar Y)^2$$` - The `\(R^2\)` is the ratio of the two `$$R^2 = \frac{ESS}{TSS}$$` --- ## The `\(R^{2}\)` and Test scores data - Relatively low `\(R^{2}\)` - Other factors affect test scores - Student characteristics - Randomness in exam results --- ## [Data and regressions](http://rstudio.sh.se/content/statistics03-figs.Rmd#section-uncertainty) - Regression with `\(\beta=\left(0,1\right)\)`, `\(\sigma_u^{2}=1\)` and `\(0<x<10\)` - Grey area possible linear relationships within 95% CI ![](figures/ldisp.jpg) --- ## Large variance increases uncertainty - Regression with `\(\sigma_u^{2}=4\)` ![](figures/lvar.jpg) --- ## Small Variability in regressors - Same `\(\beta\)` and `\(\sigma_u^{2}\)`, but `\(4<x<6\)` - Note that high slope implies small intercept ![](figures/sdisp.jpg) --- ## Distribution of error `\(\varepsilon\)` and of `\(\beta\)` - Error term `\(u_i=-1,1\)` - With large sample `\(\beta\)` will be close to normal distribution - Knowledge of distribution can increase efficiency - Here we can see the *exact* relationship ![](figures/nonnormal.jpg) --- ## Non-normal residuals - Residuals are gathered around `\(\hat u_i=-1,1\)` - Not normally distributed! ![](figures/nnresid.jpg) --- class: inverse, center, middle # [Dummy variables](http://rstudio.sh.se/content/statistics04-figs.Rmd) --- ## Dummy variable regression - Line from mean for `\(female = 0\)` to mean for `\(female = 1\)` - Slope corresponds to the lower average wage for women ``` ## `geom_smooth()` using formula 'y ~ x' ``` ![](lecture02_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- ## Dummy variable regression - Same plot using _jitter_ (moving points slightly horizontally) - Data points overlap less if distplayed slightly moved ``` ## `geom_smooth()` using formula 'y ~ x' ``` ![](lecture02_files/figure-html/unnamed-chunk-6-1.png)<!-- --> --- class: inverse, center, middle # Heteroskedasticity --- ## Heteroskedasticity - Variability is often higher at higher values - Same percentage variability - Does not affect estimate - Confidence intervals change - Variance covariance is usually too low - Often increased variability for `\(X_i\)` far from mean `\(\bar X\)` - Variability at extremes results in more uncertainty than variability at the mean --- ## Heteroskedasticity plots - Same data in both examples - First and second half switch places - Notice larger uncertainty (gray area) in second figure - Observations with `\(X_{i}\)` far from mean `\(\bar X\)` are more influential - Variability far from mean increases uncertainty more --- ## Variability at the center ![](figures/hetcenter.png) --- ## Variability at the edges ![](figures/hetedges.png) --- ## Dealing with heteroskedasticity Estimate with _robust_ standard errors - Tends to give larger standard deviations - Better to be cautious... - Unfortunately robust is _not the default in Stata_