class: center, middle, inverse, title-slide # Probability and statistics ## Statistics ### Jonas Björnerstedt ### 2022-02-22 --- ## Lecture Content - Ch 3. Estimation of population mean of `\(Y\)` - Today's lecture - the basic intuition ### In econometrics course Estimation is presented in textbook chapters in increasing complexity: - Ch 4. Estimation of relationship between `\(X\)` and `\(Y\)` - Univariate regression - in econometrics course - Ch 6. Estimation of relationship between many variables `\(X,U, V, W,...\)` and `\(Y\)` - Multivariate regression --- ## Statistics - _Probability theory_ - Given probability distributions what can we say about sample? - _Statistics_ - Given sample, what can we infer about the population distribution of `\(Y\)`? - _Estimator_ - function of the data sample - a random variable!! - an estimator is a _statistic_ - Often intended to describe a property of the population - Ex: expected value `\(E(Y)\)` or variance `\(Var(Y)\)` --- ## Estimating the expected value - Estimate the expected value of random variable `\(Y\)` - Assume that we can take a sample of size `\(n\)` of `\(Y\)` `$$Y_1,Y_2,Y_3,...Y_n$$` - Independent and Identically Distributed (IID) draws - Consider two estimators of the population mean `\(\mu_Y\)` 1. Sample mean: `\(\bar Y=\frac{1}{n}\sum_{i=1}^n Y_i\)` 1. First observation of sample: `\(Y_1\)` --- class: inverse, center, middle # Tests --- ## Coin toss example - A _fair coin_ has equal probability of heads and tails - How do we determine if a coin is fair? -- 1. Toss the coin many times 1. Assign 1 if outcome is heads and -1 if tails 1. Take the average 1. Check if the average is close to zero -- - But how can be sure it is fair if for example the average is 0.1 instead of zero? - How sure are we? --- ## Expected value and average - What is the relationship between the expected value in the population and sample means taken from the population? - Can we say how close the average height of 100 randomly sampled people is to the mean value in the population (the expected value)? - Two approaches: - Calculate the distribution of the mean - The Central Limit Theorem (CLT) --- ## Design of a test - Select a _null hypothesis_ `\(H_0\)` - Create a statistic with a known distribution under `\(H_0\)` - Calculate the value of this statistic for the data - Determine the probability of actual outcome (or worse) occurring assuming `\(H_0\)` - Reject `\(H_0\)` if the probability of the actual outcome occurring is below a chosen critical level --- ## Exact distribution - If we know the distribution of the random variable `\(X\)`, we can calculate the distribution of `\(\bar X\)` - Binomial distribution of fair coin - Normal distribution - Given this calculated distribution, we can see how likely the mean in our sample is - Reject `\(H_0\)` if it is _very_ unlikely (given some choice of threschold) --- ## [Coin toss average - distribution <sup> 🔗 </sup>](http://rstudio.sh.se/content/statistics02-figs.Rmd#section-distribution) ![](statistics02_files/figure-html/unnamed-chunk-1-1.png)<!-- --> --- ## [Almost normal distribution <sup> 🔗 </sup>](http://rstudio.sh.se/content/statistics02-figs.Rmd#section-almost-normal) ![](statistics02_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- ## Central Limit Theorem (CLT) - Properties of large random samples of random variable `\(Y\)` - Take _mean_ `\(\bar Y\)` of independent random draws - `\(Y\)` can have (almost) any distribution - Central Limit Theorem (CLT) - Large sample imply that mean `\(\bar Y\)` has almost normal distribution - The expected value of `\(\bar Y\)` is `\(Y\)` - The variance of `\(\bar Y\)` is `$$Var(\bar Y)= \frac{Var(Y)}{n}$$` --- ## Tests - Two types of tests 1. Hypothesis tests - test restrictions on the parameters 2. Specification tests - test that the model is correctly specified --- ## Errors Two types of errors possible in hypothesis testing. - _Type 1 error_: Reject `\(H_0\)` but it is true - _Type 2 error_: Do not reject `\(H_0\)` but it is false --- ## Fair coin plot _p-value_ given by green areas ![](statistics02_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- ## Is a coin fair? - Assign 1 to head and -1 to tail - Under null hypothesis `\(H_0\)` is fair `\(E(Y)=\mu_Y=0\)` (blue line) - Variance is known!: `\(Var(Y)=1\)` - With alternate hypothesis `\(H_1\)` we have `\(E(Y)<0\)` or `\(E(Y)>0\)` - What is the probability that a sample average `\(\bar Y\)` is as far away as `\(\hat Y\)` or more? - Here estimate `\(\hat Y=0.4\)` (red line) - If `\(\hat Y=0.4\)` is possible, then also `\(\hat Y=-0.4\)` should be - Called the _p-value_ (green areas) --- ## [Coin toss average - find distribution <sup> 🔗 </sup>](http://rstudio.sh.se/content/statistics02-figs.Rmd#section-Find-distribution) ![](statistics02_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- ## Central limit theorem - Assume that `\(Y\)` has mean `\(\mu_Y\)` and variance `\(\sigma^2_Y\)` - Then for large `\(n\)` the sample average `\(\bar Y\)` has almost a normal distribution `$$N(\mu_Y, \sigma_Y^2/n)$$` - As the sample variance `\(s_Y^2\)` converges to the population variance `\(\sigma_Y^2\)`, we know approximately the distribution of the sample mean for large `\(n\)`. --- ## Important distributions If `\(X\)` and `\(Y\)` are independent standard normal vars - `\(e^X\)` has *lognormal distribution* - `\(X^2\)` has `\(\chi^2\)` *distribution*, 1 degree of freedom - `\(X^2+Y^2\)` has `\(\chi^2\)` *distribution*, 2 degrees of freedom - `\(\frac{X}{Y^2}\)` has *t distribution*, 1 degree of freedom - `\(\frac{X^2}{Y^2}\)` has *F distribution*, (1,1) degrees of freedom - `\(\frac{X}{Y}\)` has *Cauchy distribution* --- ## [t distrubution <sup> 🔗 </sup>](http://rstudio.sh.se/content/statistics02-figs.Rmd#section-t-distribution) * t-distibution (black) has fatter tails than normal distribution (red) ![](statistics02_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- ## [Chi-squared distribution <sup> 🔗 </sup>](http://rstudio.sh.se/content/statistics02-figs.Rmd#section-chi-distribution) ![](statistics02_files/figure-html/unnamed-chunk-6-1.png)<!-- -->