Statistics

Estimators
Confidence intervals
Vocabulary of tests

Estimators

Let θ be a parameter, the value of which is unknown and must be guessed from a sample (X₁, …, X_n).

An estimator T is a function of the sample T = τ(X₁, …, X_n). It is supposed to take values close to the parameter θ:

before the data have been collected, the estimator is a random variable
after the data have been collected, the estimate is the value assigned to the estimator

The empirical mean of a sample is a estimator of the theoretical mean of the variable. Before the data have been collected, the empirical mean is a random variable: $\bar{X} = \frac{X_1+ \ldots + X_n}n$.

Now, we collect or simulate data (x₁, …, x_n) from a Normal distribution with theoretical mean equal to μ = 10:

R> n<-100
R> X<-rnorm(n, mean=10, sd=1)
R> mean(X)
[1] 9.88436

The empirical mean (mean(X)) is a value computed from the sample X, it is the estimate of μ.

Simulating a new sample changes the estimate but the estimator remains the same (we use the same function mean, the same concept):

R> X<-rnorm(n, mean=10, sd=1)
R> mean(X)
[1] 10.09043

Here are some basic examples of estimators

The empirical frequency $\bar{X} = \frac{X_1+ \ldots + X_n}n$ of an event is an estimator of the probability of that event
The empirical mean $\bar{X} = \frac{X_1+ \ldots + X_n}n$ of a sample is an estimator of the theoretical mean of the variables.
The empirical variance $S^2=\frac{n}{n-1}\left(\frac{X_1^2+\ldots X_n^2}n-\bar X^2\right)$ of a sample is an estimator of the theoretical variance of the variables.

Example: Estimator of the variance:

R> n<-100
R> X<-rnorm(n, mean=10, sd=1)
R> var(X)
[1] 1.06025

Estimator of the probability of a binary event

R> n<-100
R> p<-0.3
R> X<-rbinom(n,1, p)
R> mean(X)
[1] 0.33

Properties of an estimator T = τ(X₁, …, X_n) are

The bias of T is the expected difference between the mean of T and the true value: Bias =𝔼(T)−θ.
The mean squared error of T is the expected squared difference: MSE =𝔼((T − θ)²)

The estimator T is:

unbiased if the bias is null (the values of T are centered on the true value),
asymptotically unbiased if the bias tends to 0 as the sample size increases to infinity,
consistent if it converges to the parameter as the sample size increases to infinity (the larger the sample, the closer the values of T to the target θ).

The empirical frequency is an unbiased consistent estimator of the probability of an event. The empirical mean is an unbiased consistent estimator the theoretical mean and the empirical variance is an unbiased consistent estimator of the theoretical variance.

Example: Illustration of the properties of the empirical mean.

N samples of size n are drawn from the normal distribution 𝒩(μ, σ). The empirical mean is computed for each of the N samples. The boxplot is represented and the true value is added as horizontal line. Code R :

R> N<-1000
R> n<-100
R> mu<-10
R> X <- rnorm(n*N,mu,1)                    # all values
R> X <- matrix(X,N,n)                          # N samples as rows
R> Tmu1 <- rowMeans(X)                         # mean
R> boxplot(Tmu1)
R> abline(h=mu,col="red")                      # true value

Résultat :

All the values of the boxplot are closed to the true value (red line), meaning that the parameter μ is well estimated, and without bias.

Confidence intervals

A confidence interval at level 1 − α for a parameter is an interval depending on the sample, that contains the true value of the parameter with probability 1 − α (usually 1 − α = 0.95 or 1 − α = 0.99).

Confidence intervals for a Gaussian sample

For a random sample of the normal distribution 𝒩(μ, σ) (Gaussian sample), one can compute exact confidence interval of the theoretical mean μ:

Let (X₁, …, X_n) be a sample of n independent identically distributed (iid) normal variables 𝒩(μ, σ).

If the theoretical standard deviation σ is known, a confidence interval at level 1 − α for μ is obtained by:
$$[\bar{X}-u_{1-\alpha/2}\frac{\sigma}{\sqrt{n}};\bar{X}+u_{1-\alpha/2}\frac{\sigma}{\sqrt{n}}]$$
where u_{1 − α/2} is the quantile of order 1 − α/2 for the normal distribution 𝒩(0, 1).
If the theoretical standard deviation σ is unknown, the estimator S² is used and a confidence interval at level 1 − α for μ is obtained by:
$$[\bar{X}-t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}};\bar{X}+t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}}]$$
where t_{1 − α/2} is the quantile of order 1 − α/2 for the Student distribution with parameter n − 1.

Example: Illustration of the notion of confidence interval.

It takes N samples of size n from the normal distribution 𝒩(μ, σ) with μ and σ chosen by the user. It computes the N confidence intervals for μ assuming σ known. It represents the intervals by blue horizontal segments and the true value of μ by a red vertical line.

Click on this link to play with confidence intervals

When the confidence level is 95%, around 5% of the confidence intervals (blue segments) do not contain the (red) true value. Be careful with the interpretation of a confidence interval. The theoretical confidence interval $[\bar{X}-t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}};\bar{X}+t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}}]$ contains the true value μ with probability 95%. But each of the N realized intervals contains or not the true value. The probability of containing the true value has no sense for a realized confidence interval.

Play with the application to see the influence of * the confidence level * the sample size * the variance of the sample

The confidence interval of μ when σ is unknown is computed with t.test(X)$conf.int.

Example: We simulate a random sample from a normal distribution.

R> mu<-10
R> sig<-2
R> n<-100
R> X<-rnorm(n,mu,sig)

Then we compute the confidence interval of μ assuming σ unknown.

R> t.test(X)$conf.int
[1] 9.267687 9.962904
attr(,"conf.level")
[1] 0.95

The first output is the confidence interval of μ from the sample X. The second output is the confidence level, 0.95 by default.

To change the level, one can use the input conf.level. For example, the confidence interval with level 0.80 is

R> t.test(X, conf.level=0.80)$conf.int
[1] 9.389276 9.841315
attr(,"conf.level")
[1] 0.8

R> t.test(X, conf.level=0.99)$conf.int
[1]  9.155184 10.075407
attr(,"conf.level")
[1] 0.99

If the theoretical variance σ² is unknown, a confidence interval at level 1 − α for σ² is obtained by:
[(n − 1)S²/v_{1 − α/2}; (n − 1)S²/u_α/2]
where u_α/2 is the quantile of order α/2 for the chi-squared distribution with parameter n − 1, and v_{1 − α/2} is its quantile of order 1 − α/2.

Example Compute the confidence interval at level 0.98% of the variance of the following sample: (5.3, 5.2, 5.6, 4.9, 5.2, 4.7, 5.3, 4.8, 5.1, 5.4).

R> sample <- c(5.3,5.2,5.6,4.9,5.2,4.7,5.3,4.8,5.1,5.4)
R> ms <- mean(sample)
R> sds <- sd(sample)
R> n <- length(sample)
R> q <- qchisq(c(0.01,0.99),df=n-1)
R> sqrt((n-1)*sds^2/rev(q))
[1] 0.180387 0.581085

Interpretation The two values are the bounds of the confidence interval of σ.

Confidence intervals for large sample

For a large sample (X₁, …, X_n) of any distribution, a confidence interval at level 1 − α for the mean is obtained by:
$$[\bar{X}-u_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}};\bar{X}+u_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}}]$$
where u_{_{1 − α/2}} is the quantile of order 1 − α/2 for the normal distribution 𝒩(0, 1).

Confidence intervals on probability for large sample

For a large binary sample (X₁, …, X_n), $\bar{X}$ is the empirical frequency of an event, called A. The confidence interval at level 1 − α for the probability of the event A is obtained by:
$$[\bar{X}-u_{1-\alpha/2}\frac{\bar{X}(1-\bar{X})}{\sqrt{n}};\bar{X}+u_{1-\alpha/2}\frac{\bar{X}(1-\bar{X})}{\sqrt{n}}]$$
where u_{_{1 − α/2}} is the quantile of order 1 − α/2 for the normal distribution 𝒩(0, 1).

Example Simulate a binary sample with probability p = 0.7 of success. The size of the sample is n = 100. Compute the confidence interval at level 0.98% of the proportion, considering that the sample is large.

R> p <- 0.7
R> n <- 100
R> SS <- sample(c(0,1),n,replace=TRUE,prob=c(1-p,p)); mean(SS)
[1] 0.6
R> freq <- mean(SS)
R> freq
[1] 0.6

Recall that for a binary sample taken values in {0, 1}, the empirical mean (mean) is an estimator of the probability of success.

R> q <- qnorm(c(0.01,0.99))
R> sdf <- sqrt(freq*(1-freq))
R> freq+q*sdf/sqrt(n)
[1] 0.4860327 0.7139673

Interpretation The two values are the bounds of the confidence interval of p

Vocabulary of tests

A statistical test is a decision method that helps us to examine two opposing conjectures, called hypotheses and to validate or invalidate these hypotheses with a certain degree of confidence.

Null and alternative hypotheses

In a statistical test, one want to examine two opposing conjectures, also called hypotheses H₀ and H₁. These two hypotheses are mutually exclusive and exhaustive so that one is true to the exclusion of the other. Usually we choose the two hypotheses such that:

the null hypothesis H₀ is the one you must have good reasons to reject
the alternative H₁ is the opposite: the hypothesis you must have good reasons to accept

The purpose of a statistical test is to determine which of the two hypotheses is true and which is false from information collected in a sample.

Example: Test of the mean: the null hypothesis is the mean is zero (H₀: μ = 0) against the alternative hypothesis the mean is not zero (H₁: μ ≠ 0).

In practice, the null hypothesis always contains the '=' sign, and the alternative hypothesis never contains just the '=' sign.

Notion of risks

When taking a decision, two different errors could occur:

	Decision
Truth	$H_{0}$	$H_{1}$
$H_{0}$	OK	first error
$H_{1}$	2nd error	OK

The first error is the decision of rejecting H₀ wrongly
The second error is the decision of accepting H₀ wrongly

The first error is considered as the most serious. This is the one we want to control, that is we want a small probability of the first error. This is the notion of risk:

The first kind risk is the probability of the first error, usually denoted
P_H₀[Reject H₀]=P[Reject H₀ wrongly]=P[Reject H₀ while H₀ is true]=α
The second kind risk is the probability of the second error, of accepting H₀ wrongly, ie the probability of accepting H₀ when the alternative hypothesis H₁ is true.
P_H₁[Accept H₀]=β
The power of the test is 1 − β. It is the probability of rightyfully rejecting H₀.

Example For an adult, the logarithm of the D-dimer concentration, denoted by X, is modeled by a normal random variable with expectation μ and standard deviation σ. The variable X is an indicator for the risk of thrombosis: it is considered that for healthy individuals, μ is −1; whereas for individuals at risk, μ is 0. In both cases, the value of σ is the same: 0.3.

Dr. House does not want to worry his patients if there is no need to. What hypotheses H₀ and H₁ will he choose to test?

Answer If Dr. House does not want to worry a patient, the hypothesis he considers as dangerous to reject wrongly is that the patient is not at risk, thus that his value of X (the test statistic) has expectation −1. His hypothesis H₀ is μ = −1 (the patient is not at risk), that he will test against H₁: μ = 0 (the patient is at risk).

Dr. Cuddy's point of view is that she'd rather worry a patient wrongly than not warn him of an actual risk. What hypotheses H′₀ and H′₁ will she choose to test?

Answer If Dr. Cuddy does not want to miss a patient at risk, the hypothesis she considers as dangerous to reject wrongly is that he is at risk, and that his variable X has expectation 0. Her hypothesis H′₀ is μ = 0 (the patient is at risk), that she will test against H′₁: μ = −1 (the patient is not at risk).

Test statistic

The statistical test helps the statistician to decide between the two hypotheses, based on the information contained in the data. A rule of decision is defined based on a test statistic. The two notions are defined below.

A statistical test uses the observations to determine the statistical support for the null hypothesis H₀ against the alternative H₁.

It needs a test statistic:

a test statistic is a function of the data, for which the probability distribution under the null hypothesis H₀ is known
a decision rule that specifies, as a function of the values taken by the test statistic, in which cases the hypothesis H₀ should be rejected

Example Dr. House will choose to reject too high values for X, ie to decide to reject H₀ (and decide H₁) when X has high values. Dr. Cuddy will choose to reject lower values of X, ie to decide to reject H₀ (and decide H₁) when X has low values.

One-sided or two-sided tests

According to the alternative, the test may be

left-sided if the decision rule is
Reject H₀ if T < ℓ
(reject too small values)
right-sided if the decision rule is
Reject H₀ if T > ℓ
(reject too large values)
two-sided if the decision rule is
Reject H₀ if T ∉ [ℓ,ℓ′]
(reject too small or too large values)

The values ℓ and ℓ′ are chosen to control the probability of the error of rejecting H₀ wrongly. This is an important notion in statistical testing, that is developed below.

The values ℓ and ℓ′ are chosen such that the first kind risk is controlled. The standard choices are:

left-sided test: ℓ is such that P_H₀[T < ℓ]=α
right-sided test: ℓ is such that P_H₀[T > ℓ]=α
two-sided test: ℓ and ℓ′ are chosen such that P_H₀[T < ℓ]=P_H₀[T > ℓ′] = α/2

The first kind risk is usually α = 5%.

Example

Give the decision rule for Dr House's test, at threshold 1%, and at threshold 5%.

The test is right-sided (Dr House will choose to reject too high values for X). The decision rule will be:

Reject H₀ ⇔ X > l ,

where ℙ_H₀[X > l]=α .

According to the null hypothesis H₀, the test statistic X follows the ${\cal N}(-1,0.3)$ distribution. The bound of the rejection region is the quantile of that distribution at level 1 − α:

R> qnorm(0.95,mean=-1,sd=0.3)
[1] -0.5065439
R> qnorm(0.99,mean=-1,sd=0.3)
[1] -0.3020956

Interpretation Dr. House declares the patient is at risk if the logarithm of his D-dimer concentration is higher than −0.5065 for a risk of 0.05%, −0.3021 for a risk of 0.01%.

Give the decision rule for Dr Cuddy's test, at threshold 1%, and at threshold 5%.

The test is left-sided (she will choose to reject lower values of X). The decision rule will be:

Reject H′₀ ⇔ X < l′ ,
where ℙ_H′₀[X < l′] = α .

Under hypothesis H′₀, the test statistic X follows the ${\cal N}(0,0.3)$ distribution. The bound of the rejection region is the quantile of that distribution at the threshold 1 − α:

R> qnorm(0.05,mean=0,sd=0.3)
[1] -0.4934561
R> qnorm(0.01,mean=0,sd=0.3)
[1] -0.6979044

Interpretation Dr. Cuddy declares the patient is not at risk if the logarithm of his D-dimer concentration is lower than −0.4935 for 0.05, −0.6980 for 0.01.

Give the decision rule of the test for the null hypothesis H″₀ : μ = −1 against the alternative one H″₁ : μ ≠ −1, at threshold 5%.

This is a two-sided test. The decision rule will be:

Reject H″₀ ⇔ X ∉ [l₁, l₂] ,
where ℙ_H″₀[ X ∉ [l₁, l₂] ]=0.05 .

Under hypothesis H″₀, the test statistic X follows the ${\cal N}(-1,0.3)$ distribution, hence $\frac{X-(-1)}{0.3}$ follows the ${\cal N}(0,1)$. The interval [l₁ ; l₂] must contain 95% of the values taken by a variable with ${\cal N}(-1,0.3)$ distribution. The interval centered at −1 is chosen:

R> qnorm(c(0.025,0.975),mean=-1,sd=0.3)
[1] -1.5879892 -0.4120108

Interpretation At threshold 0.05 the decision rule of the two-tailed test is:
Reject H₀ ⇔ X ∉ [ − 1.588 ; − 0.412] .
The patient is said to have logarithm of D-dimer concentration significantly different from −1 when his variable X is either lower than −1.588, or larger than −0.488.

p-value

A last central notion in statistical tests is the p-value.

The p-value is the proportion of values at least as bad as the observations, if H₀ is true.

Interpretation: the p-value measures how likely are the data, assuming the null hypothesis is true. It is the maximal risk to reject H₀ wrongly.

It is used as a test statistic, with the following decision rule

If the p-value is smaller than the first kind risk, reject H₀
If the p-value is not smaller than the first kind risk, do not reject H₀.

Example

A patient has a value of X equal to −0.46. Find the p-value of Dr. House's test.

The p-value is the threshold at which −0.46 would be the bound. Knowing the results of the first question, since −0.46 lies between −0.5065 and −0.3021, the p-value must be between 0.05 and 0.01. It is the probability under H₀, that the variable X is higher than −0.46, that is the right tail at −0.46 of the ${\cal N}(-1,0.3)$ distrubution, that is

R> pnorm(-0.46,mean=-1,sd=0.3,lower.tail=FALSE)
[1] 0.03593032

Interpretation For Dr. House's test, the p-value of a patient with a value of log-dimer equal to −0.46 is 0.0359. At risk 5%, the patient is declared at risk by Dr. House.

A patient has a value of X equal to −0.46. Find the p-value for the two-sided test.

The p-value is the threshold for which the observed value would be a bound of the rejection region. That rejection region is centered at −1. The other bound should be −1 − ( − 0.46 − ( − 1)) = −1.54. The normal distribution 𝒩(−1, 0.3) being symmetric at −1, the probabilility to be smaller than −1.54 is equal to the probability to be larger than −0.46. The later is

R> pnorm(-0.46,mean=-1,sd=0.3,lower.tail=FALSE)
[1] 0.03593032

The later has been calculated in question 3; it must be doubled.

R> 2*pnorm(-0.46,mean=-1,sd=0.3,lower.tail=FALSE)
[1] 0.07186064

Interpretation The p-value for a two-sided test is that of a one-sided test, multiplied by 2.

For the two-sided test, the p-value of a patient with a value of log-dimer equal to −0.46 is 0.0718. At risk 5%, the patient is declared not at risk.