Probability and statistics

class: center, middle, inverse, title-slide

# Probability and statistics
## Statistics
### Jonas Björnerstedt
### 2022-02-22

---

## Lecture Content

- Ch 3. Estimation of population mean of `$Y$`

- Today's lecture - the basic intuition

### In econometrics course

Estimation is presented in textbook chapters in increasing complexity:

- Ch 4. Estimation of relationship between `$X$` and `$Y$`

- Univariate regression - in econometrics course

- Ch 6. Estimation of relationship between many variables `$X,U, V, W,...$` and `$Y$`

- Multivariate regression
        
---
## Statistics

- _Probability theory_ - Given probability distributions what can we say about sample?

- _Statistics_ - Given sample, what can we infer about the population distribution of `$Y$`?

- _Estimator_

- function of the data sample
        - a random variable!!

- an estimator is a _statistic_

- Often intended to describe a property of the population
        - Ex: expected value `$E(Y)$` or variance `$Var(Y)$`

---
## Estimating the expected value

- Estimate the expected value of random variable `$Y$`

- Assume that we can take a sample of size `$n$` of `$Y$`
`$$Y_1,Y_2,Y_3,...Y_n$$`

- Independent and Identically Distributed (IID) draws

- Consider two estimators of the population mean `$\mu_Y$`

1. Sample mean: `$\bar Y=\frac{1}{n}\sum_{i=1}^n Y_i$`
    
    1. First observation of sample: `$Y_1$`
  
---
class: inverse, center, middle
# Tests
  
---
## Coin toss example

- A _fair coin_ has equal probability of heads and tails

- How do we determine if a coin is fair?

1. Toss the coin many times

1. Assign 1 if outcome is heads and -1 if tails

1. Take the average

1. Check if the average is close to zero

- But how can be sure it is fair if for example the average is 0.1 instead of zero?
   - How sure are we?

---
## Expected value and average

- What is the relationship between the expected value in the population and sample means taken from the population?

- Can we say how close the average height of 100 randomly sampled people is to the mean value in the population (the expected value)?

- Two approaches:

- Calculate the distribution of the mean

- The Central Limit Theorem (CLT)

---
## Design of a test

- Select a _null hypothesis_ `$H_0$`

- Create a statistic with a known distribution under `$H_0$`

- Calculate the value of this statistic for the data

- Determine the probability of actual outcome (or worse) occurring assuming `$H_0$`

- Reject `$H_0$` if the probability of the actual outcome occurring is below a chosen critical level

---
## Exact distribution

- If we know the distribution of the random variable `$X$`, we can calculate the distribution of `$\bar X$`

- Binomial distribution of fair coin
  
  - Normal distribution
  
- Given this calculated distribution, we can see how likely the mean in our sample is

- Reject `$H_0$` if it is _very_ unlikely (given some choice of threschold)

---
## [Coin toss average - distribution 🔗 ](http://rstudio.sh.se/content/statistics02-figs.Rmd#section-distribution)

![](statistics02_files/figure-html/unnamed-chunk-1-1.png)

---
## [Almost normal distribution 🔗 ](http://rstudio.sh.se/content/statistics02-figs.Rmd#section-almost-normal)

![](statistics02_files/figure-html/unnamed-chunk-2-1.png)

---
## Central Limit Theorem (CLT)

- Properties of large random samples of random variable `$Y$`

- Take _mean_ `$\bar Y$` of independent random draws

- `$Y$` can have (almost) any distribution
    
- Central Limit Theorem (CLT)

- Large sample imply that mean `$\bar Y$` has almost normal distribution
    
    - The expected value of `$\bar Y$` is `$Y$`
    
    - The variance of `$\bar Y$` is 
    `$$Var(\bar Y)= \frac{Var(Y)}{n}$$`

---
## Tests

- Two types of tests

1. Hypothesis tests - test restrictions on the parameters

2. Specification tests - test that the model is correctly specified

---
## Errors

Two types of errors possible in hypothesis testing.

- _Type 1 error_: Reject `$H_0$` but it is true

- _Type 2 error_: Do not reject `$H_0$` but it is false

---
## Fair coin plot

_p-value_ given by green areas
 
![](statistics02_files/figure-html/unnamed-chunk-3-1.png)

---
## Is a coin fair?

- Assign 1 to head and -1 to tail

- Under null hypothesis `$H_0$` is fair `$E(Y)=\mu_Y=0$` (blue line)

- Variance is known!: `$Var(Y)=1$`

- With alternate hypothesis `$H_1$` we have `$E(Y)<0$` or `$E(Y)>0$`

- What is the probability that a sample average `$\bar Y$` is as far away as `$\hat Y$` or more?

- Here estimate `$\hat Y=0.4$` (red line)

- If `$\hat Y=0.4$` is possible, then also `$\hat Y=-0.4$` should be

- Called the _p-value_ (green areas)

---
## [Coin toss average - find distribution 🔗 ](http://rstudio.sh.se/content/statistics02-figs.Rmd#section-Find-distribution)

![](statistics02_files/figure-html/unnamed-chunk-4-1.png)

---
## Central limit theorem

- Assume that `$Y$` has mean `$\mu_Y$` and variance `$\sigma^2_Y$`

- Then for large `$n$` the sample average `$\bar Y$` has almost a normal distribution 
`$$N(\mu_Y, \sigma_Y^2/n)$$`

- As the sample variance `$s_Y^2$` converges to the population variance `$\sigma_Y^2$`, we know approximately the distribution of the sample mean for large `$n$`.

---
## Important distributions

If `$X$` and `$Y$` are independent standard normal vars

- `$e^X$` has *lognormal distribution*

- `$X^2$` has `$\chi^2$` *distribution*, 1 degree of freedom

- `$X^2+Y^2$` has `$\chi^2$` *distribution*, 2 degrees of freedom

- `$\frac{X}{Y^2}$` has *t distribution*, 1 degree of freedom

- `$\frac{X^2}{Y^2}$` has *F distribution*, (1,1) degrees of freedom

- `$\frac{X}{Y}$` has *Cauchy distribution*

---
## [t distrubution 🔗 ](http://rstudio.sh.se/content/statistics02-figs.Rmd#section-t-distribution)

* t-distibution (black) has fatter tails than normal distribution (red)
![](statistics02_files/figure-html/unnamed-chunk-5-1.png)

---
## [Chi-squared distribution 🔗 ](http://rstudio.sh.se/content/statistics02-figs.Rmd#section-chi-distribution)

![](statistics02_files/figure-html/unnamed-chunk-6-1.png)