Econometrics B/C - Lecture 2

class: center, middle, inverse, title-slide

# Econometrics B/C - Lecture 2
## Linear Regression with One Regressor
### Jonas Björnerstedt
### 2021-10-11

---

## Lecture Content

* Chapter 3. Linear regression

1. Linear Regression with One Regressor

- Generalize the concept of _mean_
    
        - _Conditional mean_ `$E(Y|X)$` of `$Y$`
    
        - Given `$X$`, what is the mean of `$Y$`?

1. Dummy variables

2. Heteroskedasticity

---
## Variance and correlation

- Relationship between different random variables

- Variance: `$Var(Y)=E[(Y - E(Y))^2]$`

- Expected square distance from mean

- Average square distance in sample

- Standard deviation `$SD(Y) = \sqrt{Var(Y)}$`

- Covariance: 
`$$Cov(X, Y) = E[(X - E(X))(Y- E(Y))]$$`

- Expected product of deviations from average for `$X$` and `$Y$`

- Average product of deviations in sample

---
## Correlation coefficent

- Normalize covariance with the standard deviation of `$X$` and `$Y$`
`$$\rho_{XY} = \frac{Cov(X, Y)}{SD(X) SD(Y) }$$`

- We then have
`$$-1 \le\rho_{XY}\le 1$$`

---
## [Correlation - linear relationship](http://rstudio.sh.se/content/statistics03-figs#section-correlation)

```
## `geom_smooth()` using formula 'y ~ x'
```

![](lecture02_files/figure-html/unnamed-chunk-1-1.png)

---
## [Correlation and regression](http://rstudio.sh.se/content/statistics03-figs#section-correlation-and-regression)

![](lecture02_files/figure-html/unnamed-chunk-2-1.png)

---
## Population equation

- How does `$Y$` depend on `$X$`?

- Cannot hope to fully describe the relationship

- Focus on the *conditional expectation*:

- How does the *expected value* of `$Y$` depend on `$X$`?

- To do this, we want a function `$f(X)$` such that
`$$E(Y|X)=f(X)$$`

- In linear models `$f$` is given by the _population regression_ line
    `$$E(Y|X) = \beta_{0}+\beta_{1} X$$`

- `$\beta_{0}$` and `$\beta_{1}$` are _parameters_ in the model

---
## Linear regression model

- Given a _population regression_ function
    `$$E(Y|X) = \beta_{0}+\beta_{1} X$$`

- A sample will consist of `$n$` observations `$X_{i}$` and `$Y_{i}$`

- Each pair idependently and identically distributed with

`$$Y_{i} = \beta_{0} + \beta_{1} X_{i} + u_{i}$$`

- `$u_{i}$` is the _error term_ with
`$$E(u_{i}|X_{i}) = 0$$`

---
## Ordinary Least Squares (OLS) estimation

- Given observations `$X_{i}$` and `$Y_{i}$`

- Find `$\hat\beta_{0}$` and `$\hat\beta_{1}$` minimizing
`$$\sum_{i=1}^n \hat u_{i}^2$$`

- where the residuals `$\hat u_{i}$` are defined as:
`$$\hat u_{i} = Y_{i} - \hat\beta_{0} - \hat\beta_{1} X_{i}$$`

- Defines the line with the minimum square distance to the observations

---
## [Line with minimum square distance](http://rstudio.sh.se/content/statistics03-figs#section-least-squares)

![](lecture02_files/figure-html/unnamed-chunk-3-1.png)

---
## [Linear regression](http://rstudio.sh.se/content/statistics03-figs#section-linear-regression)

![](lecture02_files/figure-html/unnamed-chunk-4-1.png)

---
## Galton's regression

- 'Regression to the mean'

- tall parents tend to have shorter children

- short parents tend to have longer

- Can regress in either direction

- tall children tend to have shorter parents

- A regression is a means of expressing correlation

- The regressors do not **cause** the dependent variable to change!

- No causation even if relationship is strong

---
## Predicted values

- The parameters  `$\hat\beta_{0}$`, `$\hat\beta_{1}$` and `$\hat u_i$` fit the data. For all `$i$` we have
`$$Y_{i} = \hat\beta_{0} + \hat\beta_{1} X_{i} + \hat u_i$$`

- The predicted value `$\hat Y_{i}$` of the linear model is given by
  `$$\hat Y_{i} = \hat\beta_{0} + \hat\beta_{1} X_{i}$$`

- Out of sample prediction `$Y$` can be obtained by inserting values of `$X$` not in sample:
`$$Y = \hat\beta_{0} + \hat\beta_{1} X$$`

---
## Measures of fit

- The `$R^2$` statistic is a measure of how much of the variance of `$Y$` is explained by `$X$`

- The sample variance of `$Y_{i}$` is given by
`$$TSS=\sum_{i=1}^{n} (Y_{i} - \bar Y)^2$$`

- The variance of the predicted values `$\hat Y_{i}$` is given by
`$$ESS=\sum_{i=1}^{n} (\hat Y_{i} - \bar Y)^2$$`

- The `$R^2$` is the ratio of the two
`$$R^2 = \frac{ESS}{TSS}$$`
---
## The `$R^{2}$` and Test scores data

- Relatively low `$R^{2}$`

- Other factors affect test scores

- Student characteristics

- Randomness in exam results

---
## [Data and regressions](http://rstudio.sh.se/content/statistics03-figs.Rmd#section-uncertainty)

- Regression with `$\beta=\left(0,1\right)$`, `$\sigma_u^{2}=1$` and `$0<x<10$`

- Grey area possible linear relationships within 95% CI

![](figures/ldisp.jpg)

---
## Large variance increases uncertainty

- Regression with `$\sigma_u^{2}=4$`

![](figures/lvar.jpg)

---
## Small Variability in regressors

- Same `$\beta$` and `$\sigma_u^{2}$`, but `$4<x<6$`

- Note that high slope implies small intercept

![](figures/sdisp.jpg)

---
## Distribution of error `$\varepsilon$` and of `$\beta$`

- Error term `$u_i=-1,1$`
- With large sample `$\beta$` will be close to normal distribution
- Knowledge of distribution can increase efficiency
    - Here we can see the *exact* relationship

![](figures/nonnormal.jpg)

---
## Non-normal residuals

- Residuals are gathered around `$\hat u_i=-1,1$`
- Not normally distributed!
![](figures/nnresid.jpg)

---
class: inverse, center, middle
# [Dummy variables](http://rstudio.sh.se/content/statistics04-figs.Rmd)

---
## Dummy variable regression

- Line from mean for `$female = 0$` to mean for `$female = 1$`

- Slope corresponds to the lower average wage for women

```
## `geom_smooth()` using formula 'y ~ x'
```

![](lecture02_files/figure-html/unnamed-chunk-5-1.png)

---
## Dummy variable regression

- Same plot using _jitter_ (moving points slightly horizontally)

- Data points overlap less if distplayed slightly moved

```
## `geom_smooth()` using formula 'y ~ x'
```

![](lecture02_files/figure-html/unnamed-chunk-6-1.png)

---
class: inverse, center, middle
# Heteroskedasticity

---
## Heteroskedasticity

- Variability is often higher at higher values

- Same percentage variability

- Does not affect estimate

- Confidence intervals change

- Variance covariance is usually too low
    
    - Often increased variability for `$X_i$` far from mean
        `$\bar X$`
        
    - Variability at extremes results in more uncertainty than
        variability at the mean

---
## Heteroskedasticity plots

- Same data in both examples

- First and second half switch places

- Notice larger uncertainty (gray area) in second figure

- Observations with `$X_{i}$` far from mean `$\bar X$` are more
        influential
        
    - Variability far from mean increases uncertainty more
    
---
## Variability at the center

![](figures/hetcenter.png)

---
## Variability at the edges

![](figures/hetedges.png)

---
## Dealing with heteroskedasticity

Estimate with _robust_ standard errors

- Tends to give larger standard deviations

- Better to be cautious...

- Unfortunately robust is _not the default in Stata_