class: center, middle, inverse, title-slide # B/C Econometrics - Lecture 4 ## Dummy variables and Panel data ### Jonas Björnerstedt ### 2021-10-25 --- ## Lecture Content 1. Dummy variables 1. Panel data - Endogeneity - unobserved individual effects - Least Square Dummy Variables (LSDV) estimation - Fixed Effects estimation 3. Robust standard errors 4. Some R - pipes --- class: inverse, center, middle # [Dummy variables](https://rstudio.sh.se/content/statistics04-figs/) --- # Regression and dummy variables * [Dummy variable exercise](https://rstudio.sh.se/content/statistics04-figs/) <sup> 🔗 </sup> --- ## Relationship differs by gender ```r ggplot(lengths) + aes(length, weight, color= gender) + geom_point() + geom_smooth(method = "lm") ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` ![](lecture04_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- ## Regression with dummy * Only different intercept ```r reg = lm(weight ~ length + gender, data = lengths) ggplot(lengths) + aes(length, weight, color= gender) +geom_point() + geom_line(aes(y = predict(reg))) ``` ![](lecture04_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- class: inverse, center, middle # Panel data and Fixed effects --- ## Panel data = multiple observations - **Cross section** - observations over `\(n\)` individuals in one time period - **Time series** - observations of one individual over `\(T\)` time periods - **Panel data** - observations of `\(n\)` individuals over `\(T\)` time periods - Also called *longitudal* data - Can be - **balanced**: exactly `\(T\)` observations per `\(i\)` - **unbalanced**: `\(t \le T\)` observations per `\(i\)` - Handled the same way. - Think about endogeneity: Why is the data unbalanced?! --- ## Panel data - Advantages Many observations of same individual `\(i\)` over time `\(t\)` gives: - More data - Can handle unobserved individual characteristics - Can handle autocorrelation and heteroskedasticity - not discussed in this course --- ## Crime dataset * In `wooldridge` package * Relationship between law enforcement and crime * `prbarr` - Probability of Arrest * `crmrte` - Crime Rate * Focus on 4 counties ```r library(wooldridge) data(crime4) css = filter(crime4, county %in% c(1,3,145,23) ) # subset to 4 counties ``` --- ## Crime plot ```r ggplot(css,aes(x = prbarr, y = crmrte)) + geom_point() + geom_smooth(method="lm",se=FALSE) + theme_xaringan() + labs(x = 'prbarr - Probability of Arrest', y = 'crmrte - Crime Rate') ``` ![](lecture04_files/figure-html/crime1-1.png)<!-- --> --- ## Effect of change in variable * How much higher do we expect crime to be if the probability of arrest goes from 0.2 to 0.3? (or 20% to 30% in other words) ```r xsection = lm_robust(crmrte ~ prbarr, data = css) xsection_p = predict(xsection, newdata = data.frame(prbarr = c(0.2,0.3) ) ) kable(xsection_p) ``` | x| |---------:| | 0.0214952| | 0.0279753| * predict is used to obtain prediction on actual data (fitted values) or on hypothetical values --- ## Panel data * Different areas have different crime rates ![](lecture04_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- ## District relationships * Looks like they all have similar slopes ![](lecture04_files/figure-html/unnamed-chunk-9-1.png)<!-- --> --- ## Only different intercept - _Fixed Effect_ * With the same slope, only different intercepts ![](lecture04_files/figure-html/dummy-1.png)<!-- --> --- ## Calculating the slope * Now we will use three different methods for estimating the relationship 1) Use dummy variables 2) Subtract off mean (demean) 3) Fixed effects estimator * In practice we use the last method --- ### Dummy Variable Regression ```r library(broom) # pretty print regression results mod = list() dvreg = lm(crmrte ~ prbarr + factor(county) + 0, css) tidy(dvreg) # pretty print regression results ```
term
estimate
std.error
statistic
p.value
prbarr
-0.0284
0.0136
-2.08
0.0486
factor(county)1
0.0449
0.00456
9.87
9.85e-10
factor(county)3
0.0199
0.00265
7.54
1.18e-07
factor(county)23
0.0364
0.004
9.1
4.37e-09
factor(county)145
0.0384
0.0049
7.85
5.98e-08
--- ## Demeaning * subtract off mean in each county from observations in that county ```r css2 = group_by(css, county) cdata = mutate(css2, crmrte = crmrte - mean(crmrte), prbarr = prbarr - mean(prbarr) ) ``` Estimation using demeaned variables: ```r demeanreg = lm_robust(crmrte ~ prbarr + 0, data = cdata) tidy(demeanreg) # pretty print regression results ```
term
estimate
std.error
statistic
p.value
conf.low
conf.high
df
outcome
prbarr
-0.0284
0.0177
-1.61
0.12
-0.0646
0.00785
27
crmrte
* Negative relationship --- ## Demeaning illustration ![Animation of a fixed effects panel data estimator: we remove *between group* variation and concentrate on *within group* variation only](lecture04_files/figure-html/anim-1.gif) --- ### Using a package * Different packages available for Fixed Effects estimation. Here we use `lm_robust` ```r fe_reg = lm_robust(crmrte ~ prbarr, data = css, fixed_effects = county) tidy(fe_reg) ```
term
estimate
std.error
statistic
p.value
conf.low
conf.high
df
outcome
prbarr
-0.0284
0.0202
-1.41
0.173
-0.0701
0.0134
23
crmrte
--- ## Comparing results * Same estimated coefficent * huxreg with titles for each regression ```r huxreg('Dummy' = dvreg, 'Demeaned' = demeanreg, 'FE' = fe_reg) ```
Dummy
Demeaned
FE
prbarr
-0.028 *
-0.028
-0.028
(0.014)
(0.018)
(0.020)
factor(county)1
0.045 ***
(0.005)
factor(county)3
0.020 ***
(0.003)
factor(county)23
0.036 ***
(0.004)
factor(county)145
0.038 ***
(0.005)
N
28
28
28
R2
0.991
0.159
0.893
logLik
126.516
AIC
-241.032
*** p < 0.001; ** p < 0.01; * p < 0.05.
--- class: inverse, center, middle # Heteroskedasticity --- ## What is Heteroskedasticity? - Variability is often higher at higher values - Same percentage variability - Does not affect estimate - Confidence intervals change - Variance covariance is usually too low - Often increased variability for `\(X_i\)` far from mean `\(\bar X\)` - Variability at extremes results in more uncertainty than variability at the mean --- ## Heteroskedasticity plots - Same data in both examples - First and second half switch places - Notice larger uncertainty (gray area) in second figure - Observations with `\(X_{i}\)` far from mean `\(\bar X\)` are more influential - Variability far from mean increases uncertainty more --- ## Variability at the center ![](figures/hetcenter.png) --- ## Variability at the edges ![](figures/hetedges.png) --- ## Dealing with heteroskedasticity Estimate with _robust_ standard errors - Tends to give larger standard deviations - Better to be cautious... - Unfortunately `lm()` does not calculate robust SE - Use `lm_robust()` in `estimatr` package --- class: inverse, center, middle # Pipes --- ## _Pipes_ in the tidyverse * Not really needed but makes code simpler - used in documentation * Often we want to take a dataset and perform several steps in order * Pipe operator ` %>% ` facilitates ```r select(lengths, length, weight) ``` can be written as ```r lengths %>% select(length, weight) ``` * Pipe means: put the left hand side as the first argument in the function on the right hand side * With several steps much easier to read can be written as ```r lengths %>% filter(gender == "Female") %>% select(length, weight) ``` --- ## Selecting, correlating and formatting with pipe ```r len2 = select(lengths, length, weight) len3 = correlate(len2) kable(len3) ``` * Same code - hard to read ```r kable(correlate(select(lengths, length, weight))) ``` * Same code - with pipes: `%>%` ```r lengths %>% select(length, weight) %>% correlate() %>% kable() ```