class: center, middle, inverse, title-slide # Microeconometrics - Lecture 8 ## Panel data ### Jonas Björnerstedt ### 2022-03-15 --- ## Lecture Content - Chapter 10. Panel data 1. Fixed Effects - Endogeneity - unobserved individual effects - Least Square Dummy Variables (LSDV) estimation - Fixed Effects estimation - Robust standard errors 2. Random Effects - Efficiency: observations over each individual correlated 3. Long panels --- class: inverse, center, middle # Panel data --- ## Panel data = multiple observations - **Cross section** - observations over `\(n\)` individuals in one time period - **Time series** - observations of one individual over `\(T\)` time periods - **Panel data** - observations of `\(n\)` individuals over `\(T\)` time periods - Also called *longitudal* data - Observations have two subscripts: `\(X_{it}\)` and `\(Y_{it}\)` - Can be - **balanced**: exactly `\(T\)` observations per `\(i\)` - **unbalanced**: `\(t \le T\)` observations per `\(i\)` - Handled the same way. - Think about endogeneity: Why is the data unbalanced?! --- ## Panel data - Advantages Many observations of same individual `\(i\)` over time `\(t\)` gives: - More data - Can estimate individual coefficients - Can handle unobserved individual characteristics - Can handle autocorrelation and heteroskedasticity --- ## Short and long panels - Short panel - Extension of cross section analysis - Independently sampled individuals - Arbitrary hetereoskedasticity and autocorrelation structures - Properties derived as `\(n \rightarrow \infty\)` - Long (or dynamic) panel - Time series structure becomes important - Extension of time series analysis - Properties derived as `\(T \rightarrow \infty\)` --- class: inverse, center, middle # Fixed effects --- ## Crime dataset * In `wooldridge` package * Relationship between law enforcement and crime * `prbarr` - Probability of Arrest * `crmrte` - Crime Rate * Focus on 4 counties ```r library(wooldridge) data(crime4) css = filter(crime4, county %in% c(1,3,145,23) ) # subset to 4 counties ``` --- ## Crime plot ```r ggplot(css,aes(x = prbarr, y = crmrte)) + geom_point() + geom_smooth(method="lm",se=FALSE) + theme_xaringan() + labs(x = 'prbarr - Probability of Arrest', y = 'crmrte - Crime Rate') ``` ![](me08_files/figure-html/crime1-1.png)<!-- --> --- ## Effect of change in variable * How much higher do we expect crime to be if the probability of arrest goes from 0.2 to 0.3? (or 20% to 30% in other words) ```r xsection = lm_robust(crmrte ~ prbarr, data = css) xsection_p = predict(xsection, newdata = data.frame(prbarr = c(0.2,0.3) ) ) kable(xsection_p) ``` | x| |---------:| | 0.0214952| | 0.0279753| * predict is used to obtain prediction on actual data (fitted values) or on hypothetical values --- ## Panel data * Different areas have different crime rates ![](me08_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- ## District relationships * Looks like they all have similar slopes ![](me08_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- ## Only different intercept - _Fixed Effect_ * With the same slope, only different intercepts ![](me08_files/figure-html/dummy-1.png)<!-- --> --- ## Calculating the slope * Now we will use three different methods for estimating the relationship 1) Use dummy variables 2) Subtract off mean (demean) 3) Fixed effects estimator * In practice we use the last method --- ### Dummy Variable Regression ```r library(broom) # pretty print regression results mod = list() dvreg = lm(crmrte ~ prbarr + factor(county) + 0, css) tidy(dvreg) # pretty print regression results ```
term
estimate
std.error
statistic
p.value
prbarr
-0.0284
0.0136
-2.08
0.0486
factor(county)1
0.0449
0.00456
9.87
9.85e-10
factor(county)3
0.0199
0.00265
7.54
1.18e-07
factor(county)23
0.0364
0.004
9.1
4.37e-09
factor(county)145
0.0384
0.0049
7.85
5.98e-08
--- ## Demeaning * subtract off mean in each county from observations in that county ```r css2 = group_by(css, county) cdata = mutate(css2, crmrte = crmrte - mean(crmrte), prbarr = prbarr - mean(prbarr) ) ``` Estimation using demeaned variables: ```r demeanreg = lm_robust(crmrte ~ prbarr + 0, data = cdata) tidy(demeanreg) # pretty print regression results ```
term
estimate
std.error
statistic
p.value
conf.low
conf.high
df
outcome
prbarr
-0.0284
0.0177
-1.61
0.12
-0.0646
0.00785
27
crmrte
* Negative relationship --- ## Demeaning illustration ![Animation of a fixed effects panel data estimator: we remove *between group* variation and concentrate on *within group* variation only](me08_files/figure-html/anim-1.gif) --- ### Using a package * Different packages available for Fixed Effects estimation. Here we use `lm_robust` ```r fe_reg = lm_robust(crmrte ~ prbarr, data = css, fixed_effects = county) tidy(fe_reg) ```
term
estimate
std.error
statistic
p.value
conf.low
conf.high
df
outcome
prbarr
-0.0284
0.0202
-1.41
0.173
-0.0701
0.0134
23
crmrte
--- ## Comparing results * Same estimated coefficent * huxreg with titles for each regression ```r huxreg('Dummy' = dvreg, 'Demeaned' = demeanreg, 'FE' = fe_reg) ```
Dummy
Demeaned
FE
prbarr
-0.028 *
-0.028
-0.028
(0.014)
(0.018)
(0.020)
factor(county)1
0.045 ***
(0.005)
factor(county)3
0.020 ***
(0.003)
factor(county)23
0.036 ***
(0.004)
factor(county)145
0.038 ***
(0.005)
N
28
28
28
R2
0.991
0.159
0.893
logLik
126.516
AIC
-241.032
*** p < 0.001; ** p < 0.01; * p < 0.05.
--- ## Individual intercept - LSDV estimate - Assume that the model has only one regressor `\(X\)` `$$Y_{it} = \beta_{0} + \beta_{1}X_{it} + \alpha_{i} + u_{it}$$` - Then each individual has their own intercept `$$Y_{it} = \left(\beta_{0} + \alpha_{i}\right) + \beta_{1}X_{it} + u_{it}$$` - Individual dummies catch all things constant by individual - Dummies must be included if correlated with regressors - Otherwise omitted variable: `\(W_{it} = \alpha_i\)` - If not, the estimate will be biased and inconsistent - Compare with Pooled estimate - Denoted Least Square Dummy Variable (LSDV) estimate --- ## Bias of pooled OLS and panel estimate - Bias depends on whether individual effect `\(\alpha_{i}\)` correlated with some `\(X_i\)` ![](figures/randomeffects.png) --- ## Bias of pooled OLS and panel estimate - Bias depends on whether individual effect `\(\alpha_{i}\)` correlated with some `\(X_i\)` ![](figures/fixedeffects.png) --- ## Fixed Effects (FE) Model - For each panel, take the difference with the average - Assume that the model has only one regressor `\(X\)`: `$$Y_{it} = \beta_{0} + \beta_{1}X_{it} + u_{it} + \alpha_{i}$$` - Let `\(\bar{X}_{i}\)` be the mean value of `\(X\)` for individual `\(i\)` over time `\(t\)` `$$\tilde{X}_{it} = X_{it} - \bar{X}_{i}$$` - Problem with LSDV - Concise output (we don’t care about individuals) - Computational limitations: huge variance covariance matrix - Statistically identical to FE --- ## Serial correlation, Heteroskedasticity and data - In time series data, the structure of autocorrelation is limited by data - For sample size `\(T\)`, the correlation structure is `\(T^{2}\)` - More parameters than observations! - With many individuals compared to time `\(n \gg T\)` - Arbitrary serial correlation can be accounted for - Short panel allows estimation of any form of autocorrelation and heteroskedasticity - Clustered standard errors - Also called HAC standard errors - Heteroskedacticity and Autocorrelation Consistent --- class: inverse, center, middle # Random effects --- ## Correlated errors - Errors can be seen to be correlated `$$v_{it} = u_{it} + \alpha_{i}$$` - But not correlated with any regressor ![](figures/charity.png) --- ## Random effects (RE) - Also called the Error Components model `$$v_{it} = u_{it} + \alpha_{i}$$` - A Generalized Least Squares (GLS) model - Use knowledge of error structure to increase efficiency of estimate - Strong assumptions required to treat this as correlation problem - Individual unobservables `\(v_{it}\)` are uncorrelated with `\(X_{it}\)` --- ## Random effects (RE) - If assumptions hold, RE is more efficient than - pooled estimate if unobserved heterogeneity exists - FE estimate if unobserved is uncorrelated with observed - Hausman test - Is the FE estimate significantly different than the RE? - Assumes homoskedastic error --- ## Comparison Fixed and Random effects - RE is more efficient than FE - Variation between individuals is also used - Between variation assumed to be uncorrelated with regressions - Strong assumptions required for RE to be consistent - Individual unobservables `\(W_{it}\)` are uncorrelated with observables `\(X_{it}\)` - Major point of FE is to solve problem of unobservables - Increased efficiency often lesser issue --- ## Next lecture - Chapter 8. Nonlinear models