Statistical estimation
Estimators
Let θ be a parameter, the value of which is unknown and must be guessed from a sample (X1, …, Xn).
The empirical mean of a sample is a estimator of the theoretical mean of the variable. Before the data have been collected, the empirical mean is a random variable: $\bar{X} = \frac{X_1+ \ldots + X_n}n$.
Now, we collect or simulate data (x1, …, xn) from a Normal distribution with theoretical mean equal to μ = 10:
R> n<-100 R> X<-rnorm(n, mean=10, sd=1) R> mean(X) [1] 10.17888
The empirical mean (mean(X)
) is a value computed from the sample X, it is the estimate of μ.
Simulating a new sample changes the estimate but the estimator remains the same (we use the same function mean
, the same concept):
R> X<-rnorm(n, mean=10, sd=1) R> mean(X) [1] 9.944401Here are some basic examples of estimators
Example: Estimator of the variance:
R> n<-100 R> X<-rnorm(n, mean=10, sd=1) R> var(X) [1] 0.5813947
Estimator of the probability of a binary event
R> n<-100 R> p<-0.3 R> X<-rbinom(n,1, p) R> mean(X) [1] 0.32
The empirical frequency is an unbiased consistent estimator of the probability of an event. The empirical mean is an unbiased consistent estimator the theoretical mean and the empirical variance is an unbiased consistent estimator of the theoretical variance.
Example: Illustration of the properties of the empirical mean.
N samples of size n are drawn from the normal distribution 𝒩(μ, σ). The empirical mean is computed for each of the N samples. The boxplot is represented and the true value is added as horizontal line. Code R :R> N<-1000 R> n<-100 R> mu<-10 R> X <- rnorm(n*N,mu,1) # all values R> X <- matrix(X,N,n) # N samples as rows R> Tmu1 <- rowMeans(X) # mean R> boxplot(Tmu1) R> abline(h=mu,col="red") # true value
Résultat :
All the values of the boxplot are closed to the true value (red line), meaning that the parameter μ is well estimated, and without bias.
Confidence intervals
Confidence intervals for a Gaussian sample
For a random sample of the normal distribution 𝒩(μ, σ) (Gaussian sample), one can compute exact confidence interval of the theoretical mean μ:
Example: Illustration of the notion of confidence interval.
It takes N samples of size n from the normal distribution 𝒩(μ, σ) with μ and σ chosen by the user. It computes the N confidence intervals for μ assuming σ known. It represents the intervals by blue horizontal segments and the true value of μ by a red vertical line.
Click on this link to play with confidence intervals
When the confidence level is 95%, around 5% of the confidence intervals (blue segments) do not contain the (red) true value. Be careful with the interpretation of a confidence interval. The theoretical confidence interval $[\bar{X}-t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}};\bar{X}+t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}}]$ contains the true value μ with probability 95%. But each of the N realized intervals contains or not the true value. The probability of containing the true value has no sense for a realized confidence interval.
Play with the application to see the influence of * the confidence level * the sample size * the variance of the sample
Example: We simulate a random sample from a normal distribution.
R> mu<-10 R> sig<-2 R> n<-100 R> X<-rnorm(n,mu,sig)
Then we compute the confidence interval of μ assuming σ unknown.
R> t.test(X)$conf.int [1] 9.499097 10.322427 attr(,"conf.level") [1] 0.95
The first output is the confidence interval of μ from the sample X. The second output is the confidence level, 0.95 by default.
To change the level, one can use the input conf.level
. For example, the confidence interval with level 0.80 is
R> t.test(X, conf.level=0.80)$conf.int [1] 9.643093 10.178431 attr(,"conf.level") [1] 0.8
R> t.test(X, conf.level=0.99)$conf.int [1] 9.365863 10.455661 attr(,"conf.level") [1] 0.99
Example Compute the confidence interval at level 0.98% of the variance of the following sample: (5.3, 5.2, 5.6, 4.9, 5.2, 4.7, 5.3, 4.8, 5.1, 5.4).
R> sample <- c(5.3,5.2,5.6,4.9,5.2,4.7,5.3,4.8,5.1,5.4) R> ms <- mean(sample) R> sds <- sd(sample) R> n <- length(sample) R> q <- qchisq(c(0.01,0.99),df=n-1) R> sqrt((n-1)*sds^2/rev(q)) [1] 0.180387 0.581085
Confidence intervals for large sample
Confidence intervals on probability for large sample
Example Simulate a binary sample with probability p = 0.7 of success. The size of the sample is n = 100. Compute the confidence interval at level 0.98% of the proportion, considering that the sample is large.
R> p <- 0.7 R> n <- 100 R> SS <- sample(c(0,1),n,replace=TRUE,prob=c(1-p,p)); mean(SS) [1] 0.67 R> freq <- mean(SS) R> freq [1] 0.67
Recall that for a binary sample taken values in {0, 1}, the empirical mean (mean
) is an estimator of the probability of success.
R> q <- qnorm(c(0.01,0.99)) R> sdf <- sqrt(freq*(1-freq)) R> freq+q*sdf/sqrt(n) [1] 0.5606122 0.7793878