Statistical estimation

Estimators
Confidence intervals

Estimators

Let θ be a parameter, the value of which is unknown and must be guessed from a sample (X₁, …, X_n).

An estimator T is a function of the sample T = τ(X₁, …, X_n). It is supposed to take values close to the parameter θ:

before the data have been collected, the estimator is a random variable
after the data have been collected, the estimate is the value assigned to the estimator

The empirical mean of a sample is a estimator of the theoretical mean of the variable. Before the data have been collected, the empirical mean is a random variable: $\bar{X} = \frac{X_1+ \ldots + X_n}n$.

Now, we collect or simulate data (x₁, …, x_n) from a Normal distribution with theoretical mean equal to μ = 10:

R> n<-100
R> X<-rnorm(n, mean=10, sd=1)
R> mean(X)
[1] 10.17888

The empirical mean (mean(X)) is a value computed from the sample X, it is the estimate of μ.

Simulating a new sample changes the estimate but the estimator remains the same (we use the same function mean, the same concept):

R> X<-rnorm(n, mean=10, sd=1)
R> mean(X)
[1] 9.944401

Here are some basic examples of estimators

The empirical frequency $\bar{X} = \frac{X_1+ \ldots + X_n}n$ of an event is an estimator of the probability of that event
The empirical mean $\bar{X} = \frac{X_1+ \ldots + X_n}n$ of a sample is an estimator of the theoretical mean of the variables.
The empirical variance $S^2=\frac{n}{n-1}\left(\frac{X_1^2+\ldots X_n^2}n-\bar X^2\right)$ of a sample is an estimator of the theoretical variance of the variables.

Example: Estimator of the variance:

R> n<-100
R> X<-rnorm(n, mean=10, sd=1)
R> var(X)
[1] 0.5813947

Estimator of the probability of a binary event

R> n<-100
R> p<-0.3
R> X<-rbinom(n,1, p)
R> mean(X)
[1] 0.32

Properties of an estimator T = τ(X₁, …, X_n) are

The bias of T is the expected difference between the mean of T and the true value: Bias =𝔼(T)−θ.
The mean squared error of T is the expected squared difference: MSE =𝔼((T − θ)²)

The estimator T is:

unbiased if the bias is null (the values of T are centered on the true value),
asymptotically unbiased if the bias tends to 0 as the sample size increases to infinity,
consistent if it converges to the parameter as the sample size increases to infinity (the larger the sample, the closer the values of T to the target θ).

The empirical frequency is an unbiased consistent estimator of the probability of an event. The empirical mean is an unbiased consistent estimator the theoretical mean and the empirical variance is an unbiased consistent estimator of the theoretical variance.

Example: Illustration of the properties of the empirical mean.

N samples of size n are drawn from the normal distribution 𝒩(μ, σ). The empirical mean is computed for each of the N samples. The boxplot is represented and the true value is added as horizontal line. Code R :

R> N<-1000
R> n<-100
R> mu<-10
R> X <- rnorm(n*N,mu,1)                    # all values
R> X <- matrix(X,N,n)                          # N samples as rows
R> Tmu1 <- rowMeans(X)                         # mean
R> boxplot(Tmu1)
R> abline(h=mu,col="red")                      # true value

Résultat :

All the values of the boxplot are closed to the true value (red line), meaning that the parameter μ is well estimated, and without bias.

Confidence intervals

A confidence interval at level 1 − α for a parameter is an interval depending on the sample, that contains the true value of the parameter with probability 1 − α (usually 1 − α = 0.95 or 1 − α = 0.99).

Confidence intervals for a Gaussian sample

For a random sample of the normal distribution 𝒩(μ, σ) (Gaussian sample), one can compute exact confidence interval of the theoretical mean μ:

Let (X₁, …, X_n) be a sample of n independent identically distributed (iid) normal variables 𝒩(μ, σ).

If the theoretical standard deviation σ is known, a confidence interval at level 1 − α for μ is obtained by:
$$[\bar{X}-u_{1-\alpha/2}\frac{\sigma}{\sqrt{n}};\bar{X}+u_{1-\alpha/2}\frac{\sigma}{\sqrt{n}}]$$
where u_{1 − α/2} is the quantile of order 1 − α/2 for the normal distribution 𝒩(0, 1).
If the theoretical standard deviation σ is unknown, the estimator S² is used and a confidence interval at level 1 − α for μ is obtained by:
$$[\bar{X}-t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}};\bar{X}+t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}}]$$
where t_{1 − α/2} is the quantile of order 1 − α/2 for the Student distribution with parameter n − 1.

Example: Illustration of the notion of confidence interval.

It takes N samples of size n from the normal distribution 𝒩(μ, σ) with μ and σ chosen by the user. It computes the N confidence intervals for μ assuming σ known. It represents the intervals by blue horizontal segments and the true value of μ by a red vertical line.

Click on this link to play with confidence intervals

When the confidence level is 95%, around 5% of the confidence intervals (blue segments) do not contain the (red) true value. Be careful with the interpretation of a confidence interval. The theoretical confidence interval $[\bar{X}-t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}};\bar{X}+t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}}]$ contains the true value μ with probability 95%. But each of the N realized intervals contains or not the true value. The probability of containing the true value has no sense for a realized confidence interval.

Play with the application to see the influence of * the confidence level * the sample size * the variance of the sample

The confidence interval of μ when σ is unknown is computed with t.test(X)$conf.int.

Example: We simulate a random sample from a normal distribution.

R> mu<-10
R> sig<-2
R> n<-100
R> X<-rnorm(n,mu,sig)

Then we compute the confidence interval of μ assuming σ unknown.

R> t.test(X)$conf.int
[1]  9.499097 10.322427
attr(,"conf.level")
[1] 0.95

The first output is the confidence interval of μ from the sample X. The second output is the confidence level, 0.95 by default.

To change the level, one can use the input conf.level. For example, the confidence interval with level 0.80 is

R> t.test(X, conf.level=0.80)$conf.int
[1]  9.643093 10.178431
attr(,"conf.level")
[1] 0.8

R> t.test(X, conf.level=0.99)$conf.int
[1]  9.365863 10.455661
attr(,"conf.level")
[1] 0.99

If the theoretical variance σ² is unknown, a confidence interval at level 1 − α for σ² is obtained by:
[(n − 1)S²/v_{1 − α/2}; (n − 1)S²/u_α/2]
where u_α/2 is the quantile of order α/2 for the chi-squared distribution with parameter n − 1, and v_{1 − α/2} is its quantile of order 1 − α/2.

Example Compute the confidence interval at level 0.98% of the variance of the following sample: (5.3, 5.2, 5.6, 4.9, 5.2, 4.7, 5.3, 4.8, 5.1, 5.4).

R> sample <- c(5.3,5.2,5.6,4.9,5.2,4.7,5.3,4.8,5.1,5.4)
R> ms <- mean(sample)
R> sds <- sd(sample)
R> n <- length(sample)
R> q <- qchisq(c(0.01,0.99),df=n-1)
R> sqrt((n-1)*sds^2/rev(q))
[1] 0.180387 0.581085

Interpretation The two values are the bounds of the confidence interval of σ.

Confidence intervals for large sample

For a large sample (X₁, …, X_n) of any distribution, a confidence interval at level 1 − α for the mean is obtained by:
$$[\bar{X}-u_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}};\bar{X}+u_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}}]$$
where u_{_{1 − α/2}} is the quantile of order 1 − α/2 for the normal distribution 𝒩(0, 1).

Confidence intervals on probability for large sample

For a large binary sample (X₁, …, X_n), $\bar{X}$ is the empirical frequency of an event, called A. The confidence interval at level 1 − α for the probability of the event A is obtained by:
$$[\bar{X}-u_{1-\alpha/2}\frac{\bar{X}(1-\bar{X})}{\sqrt{n}};\bar{X}+u_{1-\alpha/2}\frac{\bar{X}(1-\bar{X})}{\sqrt{n}}]$$
where u_{_{1 − α/2}} is the quantile of order 1 − α/2 for the normal distribution 𝒩(0, 1).

Example Simulate a binary sample with probability p = 0.7 of success. The size of the sample is n = 100. Compute the confidence interval at level 0.98% of the proportion, considering that the sample is large.

R> p <- 0.7
R> n <- 100
R> SS <- sample(c(0,1),n,replace=TRUE,prob=c(1-p,p)); mean(SS)
[1] 0.67
R> freq <- mean(SS)
R> freq
[1] 0.67

Recall that for a binary sample taken values in {0, 1}, the empirical mean (mean) is an estimator of the probability of success.

R> q <- qnorm(c(0.01,0.99))
R> sdf <- sqrt(freq*(1-freq))
R> freq+q*sdf/sqrt(n)
[1] 0.5606122 0.7793878

Interpretation The two values are the bounds of the confidence interval of p