Statistics

Main Page

Estimators

Let θ be a parameter, the value of which is unknown and must be guessed from a sample (X1, …, Xn).

An estimator T is a function of the sample T = τ(X1, …, Xn). It is supposed to take values close to the parameter θ:

  1. before the data have been collected, the estimator is a random variable
  2. after the data have been collected, the estimate is the value assigned to the estimator

The empirical mean of a sample is a estimator of the theoretical mean of the variable. Before the data have been collected, the empirical mean is a random variable: $\bar{X} = \frac{X_1+ \ldots + X_n}n$.

Now, we collect or simulate data (x1, …, xn) from a Normal distribution with theoretical mean equal to μ = 10:
R> n<-100
R> X<-rnorm(n, mean=10, sd=1)
R> mean(X)
[1] 9.88436

The empirical mean (mean(X)) is a value computed from the sample X, it is the estimate of μ.

Simulating a new sample changes the estimate but the estimator remains the same (we use the same function mean, the same concept):
R> X<-rnorm(n, mean=10, sd=1)
R> mean(X)
[1] 10.09043

Here are some basic examples of estimators
  • The empirical frequency $\bar{X} = \frac{X_1+ \ldots + X_n}n$ of an event is an estimator of the probability of that event
  • The empirical mean $\bar{X} = \frac{X_1+ \ldots + X_n}n$ of a sample is an estimator of the theoretical mean of the variables.
  • The empirical variance $S^2=\frac{n}{n-1}\left(\frac{X_1^2+\ldots X_n^2}n-\bar X^2\right)$ of a sample is an estimator of the theoretical variance of the variables.
Example: Estimator of the variance:
R> n<-100
R> X<-rnorm(n, mean=10, sd=1)
R> var(X)
[1] 1.06025

Estimator of the probability of a binary event
R> n<-100
R> p<-0.3
R> X<-rbinom(n,1, p)
R> mean(X)
[1] 0.33

Properties of an estimator T = τ(X1, …, Xn) are

  • The bias of T is the expected difference between the mean of T and the true value: Bias =𝔼(T)−θ.
  • The mean squared error of T is the expected squared difference: MSE =𝔼((T − θ)2)

The estimator T is:

  • unbiased if the bias is null (the values of T are centered on the true value),
  • asymptotically unbiased if the bias tends to 0 as the sample size increases to infinity,
  • consistent if it converges to the parameter as the sample size increases to infinity (the larger the sample, the closer the values of T to the target θ).

The empirical frequency is an unbiased consistent estimator of the probability of an event. The empirical mean is an unbiased consistent estimator the theoretical mean and the empirical variance is an unbiased consistent estimator of the theoretical variance.

Example: Illustration of the properties of the empirical mean.

N samples of size n are drawn from the normal distribution 𝒩(μ, σ). The empirical mean is computed for each of the N samples. The boxplot is represented and the true value is added as horizontal line. Code R :
R> N<-1000
R> n<-100
R> mu<-10
R> X <- rnorm(n*N,mu,1)                    # all values
R> X <- matrix(X,N,n)                          # N samples as rows
R> Tmu1 <- rowMeans(X)                         # mean
R> boxplot(Tmu1)
R> abline(h=mu,col="red")                      # true value

Résultat : image

All the values of the boxplot are closed to the true value (red line), meaning that the parameter μ is well estimated, and without bias.

Confidence intervals

A confidence interval at level 1 − α for a parameter is an interval depending on the sample, that contains the true value of the parameter with probability 1 − α (usually 1 − α = 0.95 or 1 − α = 0.99).

Confidence intervals for a Gaussian sample

For a random sample of the normal distribution 𝒩(μ, σ) (Gaussian sample), one can compute exact confidence interval of the theoretical mean μ:

Let (X1, …, Xn) be a sample of n independent identically distributed (iid) normal variables 𝒩(μ, σ).

  • If the theoretical standard deviation σ is known, a confidence interval at level 1 − α for μ is obtained by:
    $$[\bar{X}-u_{1-\alpha/2}\frac{\sigma}{\sqrt{n}};\bar{X}+u_{1-\alpha/2}\frac{\sigma}{\sqrt{n}}]$$
    where u1 − α/2 is the quantile of order 1 − α/2 for the normal distribution 𝒩(0, 1).
  • If the theoretical standard deviation σ is unknown, the estimator S2 is used and a confidence interval at level 1 − α for μ is obtained by:
    $$[\bar{X}-t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}};\bar{X}+t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}}]$$
    where t1 − α/2 is the quantile of order 1 − α/2 for the Student distribution with parameter n − 1.

Example: Illustration of the notion of confidence interval.

It takes N samples of size n from the normal distribution 𝒩(μ, σ) with μ and σ chosen by the user. It computes the N confidence intervals for μ assuming σ known. It represents the intervals by blue horizontal segments and the true value of μ by a red vertical line.

Click on this link to play with confidence intervals

When the confidence level is 95%, around 5% of the confidence intervals (blue segments) do not contain the (red) true value. Be careful with the interpretation of a confidence interval. The theoretical confidence interval $[\bar{X}-t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}};\bar{X}+t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}}]$ contains the true value μ with probability 95%. But each of the N realized intervals contains or not the true value. The probability of containing the true value has no sense for a realized confidence interval.

Play with the application to see the influence of * the confidence level * the sample size * the variance of the sample

The confidence interval of μ when σ is unknown is computed with t.test(X)$conf.int.
Example: We simulate a random sample from a normal distribution.
R> mu<-10
R> sig<-2
R> n<-100
R> X<-rnorm(n,mu,sig)

Then we compute the confidence interval of μ assuming σ unknown.
R> t.test(X)$conf.int
[1] 9.267687 9.962904
attr(,"conf.level")
[1] 0.95

The first output is the confidence interval of μ from the sample X. The second output is the confidence level, 0.95 by default.

To change the level, one can use the input conf.level. For example, the confidence interval with level 0.80 is

R> t.test(X, conf.level=0.80)$conf.int
[1] 9.389276 9.841315
attr(,"conf.level")
[1] 0.8

R> t.test(X, conf.level=0.99)$conf.int
[1]  9.155184 10.075407
attr(,"conf.level")
[1] 0.99

If the theoretical variance σ2 is unknown, a confidence interval at level 1 − α for σ2 is obtained by:
[(n − 1)S2/v1 − α/2; (n − 1)S2/uα/2]
where uα/2 is the quantile of order α/2 for the chi-squared distribution with parameter n − 1, and v1 − α/2 is its quantile of order 1 − α/2.

Example Compute the confidence interval at level 0.98% of the variance of the following sample: (5.3, 5.2, 5.6, 4.9, 5.2, 4.7, 5.3, 4.8, 5.1, 5.4).
R> sample <- c(5.3,5.2,5.6,4.9,5.2,4.7,5.3,4.8,5.1,5.4)
R> ms <- mean(sample)
R> sds <- sd(sample)
R> n <- length(sample)
R> q <- qchisq(c(0.01,0.99),df=n-1)
R> sqrt((n-1)*sds^2/rev(q))
[1] 0.180387 0.581085

Interpretation The two values are the bounds of the confidence interval of σ.

Confidence intervals for large sample

For a large sample (X1, …, Xn) of any distribution, a confidence interval at level 1 − α for the mean is obtained by:
$$[\bar{X}-u_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}};\bar{X}+u_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}}]$$
where u1 − α/2 is the quantile of order 1 − α/2 for the normal distribution 𝒩(0, 1).

Confidence intervals on probability for large sample

For a large binary sample (X1, …, Xn), $\bar{X}$ is the empirical frequency of an event, called A. The confidence interval at level 1 − α for the probability of the event A is obtained by:
$$[\bar{X}-u_{1-\alpha/2}\frac{\bar{X}(1-\bar{X})}{\sqrt{n}};\bar{X}+u_{1-\alpha/2}\frac{\bar{X}(1-\bar{X})}{\sqrt{n}}]$$
where u1 − α/2 is the quantile of order 1 − α/2 for the normal distribution 𝒩(0, 1).

Example Simulate a binary sample with probability p = 0.7 of success. The size of the sample is n = 100. Compute the confidence interval at level 0.98% of the proportion, considering that the sample is large.

R> p <- 0.7
R> n <- 100
R> SS <- sample(c(0,1),n,replace=TRUE,prob=c(1-p,p)); mean(SS)
[1] 0.6
R> freq <- mean(SS)
R> freq
[1] 0.6

Recall that for a binary sample taken values in {0, 1}, the empirical mean (mean) is an estimator of the probability of success.

R> q <- qnorm(c(0.01,0.99))
R> sdf <- sqrt(freq*(1-freq))
R> freq+q*sdf/sqrt(n)
[1] 0.4860327 0.7139673

Interpretation The two values are the bounds of the confidence interval of p

Vocabulary of tests

A statistical test is a decision method that helps us to examine two opposing conjectures, called hypotheses and to validate or invalidate these hypotheses with a certain degree of confidence.

Null and alternative hypotheses

In a statistical test, one want to examine two opposing conjectures, also called hypotheses H0 and H1. These two hypotheses are mutually exclusive and exhaustive so that one is true to the exclusion of the other. Usually we choose the two hypotheses such that:
  • the null hypothesis H0 is the one you must have good reasons to reject
  • the alternative H1 is the opposite: the hypothesis you must have good reasons to accept

The purpose of a statistical test is to determine which of the two hypotheses is true and which is false from information collected in a sample.

Example: Test of the mean: the null hypothesis is the mean is zero (H0: μ = 0) against the alternative hypothesis the mean is not zero (H1: μ ≠ 0).

In practice, the null hypothesis always contains the '=' sign, and the alternative hypothesis never contains just the '=' sign.

Notion of risks

When taking a decision, two different errors could occur:

Decision
Truth H0 H1
H0 OK first error
H1 2nd error OK
  • The first error is the decision of rejecting H0 wrongly
  • The second error is the decision of accepting H0 wrongly

The first error is considered as the most serious. This is the one we want to control, that is we want a small probability of the first error. This is the notion of risk:

  • The first kind risk is the probability of the first error, usually denoted
    PH0[Reject H0]=P[Reject H0 wrongly]=P[Reject H0 while H0 is true]=α
  • The second kind risk is the probability of the second error, of accepting H0 wrongly, ie the probability of accepting H0 when the alternative hypothesis H1 is true.
    PH1[Accept H0]=β
  • The power of the test is 1 − β. It is the probability of rightyfully rejecting H0.

Example For an adult, the logarithm of the D-dimer concentration, denoted by X, is modeled by a normal random variable with expectation μ and standard deviation σ. The variable X is an indicator for the risk of thrombosis: it is considered that for healthy individuals, μ is −1; whereas for individuals at risk, μ is 0. In both cases, the value of σ is the same: 0.3.

Answer If Dr. House does not want to worry a patient, the hypothesis he considers as dangerous to reject wrongly is that the patient is not at risk, thus that his value of X (the test statistic) has expectation −1. His hypothesis H0 is μ = −1 (the patient is not at risk), that he will test against H1: μ = 0 (the patient is at risk).

Answer If Dr. Cuddy does not want to miss a patient at risk, the hypothesis she considers as dangerous to reject wrongly is that he is at risk, and that his variable X has expectation 0. Her hypothesis H0 is μ = 0 (the patient is at risk), that she will test against H1: μ = −1 (the patient is not at risk).

Test statistic

The statistical test helps the statistician to decide between the two hypotheses, based on the information contained in the data. A rule of decision is defined based on a test statistic. The two notions are defined below.

A statistical test uses the observations to determine the statistical support for the null hypothesis H0 against the alternative H1.

It needs a test statistic:
  • a test statistic is a function of the data, for which the probability distribution under the null hypothesis H0 is known
  • a decision rule that specifies, as a function of the values taken by the test statistic, in which cases the hypothesis H0 should be rejected

Example Dr. House will choose to reject too high values for X, ie to decide to reject H0 (and decide H1) when X has high values. Dr. Cuddy will choose to reject lower values of X, ie to decide to reject H0 (and decide H1) when X has low values.

One-sided or two-sided tests

According to the alternative, the test may be
  • left-sided if the decision rule is
    Reject H0 if T < ℓ
    (reject too small values)
  • right-sided if the decision rule is
    Reject H0 if T > ℓ
    (reject too large values)
  • two-sided if the decision rule is
    Reject H0 if T ∉ [ℓ,ℓ′]
    (reject too small or too large values)

The values and ℓ′ are chosen to control the probability of the error of rejecting H0 wrongly. This is an important notion in statistical testing, that is developed below.

The values and ℓ′ are chosen such that the first kind risk is controlled. The standard choices are:
  • left-sided test: is such that PH0[T < ℓ]=α
  • right-sided test: is such that PH0[T > ℓ]=α
  • two-sided test: and ℓ′ are chosen such that PH0[T < ℓ]=PH0[T > ℓ′] = α/2

The first kind risk is usually α = 5%.

Example

The test is right-sided (Dr House will choose to reject too high values for X). The decision rule will be:


Reject H0  ⇔  X > l ,

where H0[X > l]=α .

According to the null hypothesis H0, the test statistic X follows the ${\cal N}(-1,0.3)$ distribution. The bound of the rejection region is the quantile of that distribution at level 1 − α:

R> qnorm(0.95,mean=-1,sd=0.3)
[1] -0.5065439
R> qnorm(0.99,mean=-1,sd=0.3)
[1] -0.3020956

Interpretation Dr. House declares the patient is at risk if the logarithm of his D-dimer concentration is higher than −0.5065 for a risk of 0.05%, −0.3021 for a risk of 0.01%.

The test is left-sided (she will choose to reject lower values of X). The decision rule will be:


Reject H0  ⇔  X < l′ ,
where H0[X < l′] = α .

Under hypothesis H0, the test statistic X follows the ${\cal N}(0,0.3)$ distribution. The bound of the rejection region is the quantile of that distribution at the threshold 1 − α:

R> qnorm(0.05,mean=0,sd=0.3)
[1] -0.4934561
R> qnorm(0.01,mean=0,sd=0.3)
[1] -0.6979044

Interpretation Dr. Cuddy declares the patient is not at risk if the logarithm of his D-dimer concentration is lower than −0.4935 for 0.05, −0.6980 for 0.01.

This is a two-sided test. The decision rule will be:


Reject H0  ⇔  X ∉ [l1, l2] ,
where H0[ X ∉ [l1, l2] ]=0.05 .

Under hypothesis H0, the test statistic X follows the ${\cal N}(-1,0.3)$ distribution, hence $\frac{X-(-1)}{0.3}$ follows the ${\cal N}(0,1)$. The interval [l1 ; l2] must contain 95% of the values taken by a variable with ${\cal N}(-1,0.3)$ distribution. The interval centered at −1 is chosen:

R> qnorm(c(0.025,0.975),mean=-1,sd=0.3)
[1] -1.5879892 -0.4120108

Interpretation At threshold 0.05 the decision rule of the two-tailed test is:
Reject H0  ⇔  X ∉ [ − 1.588 ;   − 0.412] .
The patient is said to have logarithm of D-dimer concentration significantly different from −1 when his variable X is either lower than −1.588, or larger than −0.488.

p-value

A last central notion in statistical tests is the p-value.
  • The p-value is the proportion of values at least as bad as the observations, if H0 is true.

Interpretation: the p-value measures how likely are the data, assuming the null hypothesis is true. It is the maximal risk to reject H0 wrongly.

It is used as a test statistic, with the following decision rule
  • If the p-value is smaller than the first kind risk, reject H0
  • If the p-value is not smaller than the first kind risk, do not reject H0.

Example

The p-value is the threshold at which −0.46 would be the bound. Knowing the results of the first question, since −0.46 lies between −0.5065 and −0.3021, the p-value must be between 0.05 and 0.01. It is the probability under H0, that the variable X is higher than −0.46, that is the right tail at −0.46 of the ${\cal N}(-1,0.3)$ distrubution, that is

R> pnorm(-0.46,mean=-1,sd=0.3,lower.tail=FALSE)
[1] 0.03593032

Interpretation For Dr. House's test, the p-value of a patient with a value of log-dimer equal to −0.46 is 0.0359. At risk 5%, the patient is declared at risk by Dr. House.

The p-value is the threshold for which the observed value would be a bound of the rejection region. That rejection region is centered at −1. The other bound should be −1 − ( − 0.46 − ( − 1)) = −1.54. The normal distribution 𝒩(−1, 0.3) being symmetric at −1, the probabilility to be smaller than −1.54 is equal to the probability to be larger than −0.46. The later is

R> pnorm(-0.46,mean=-1,sd=0.3,lower.tail=FALSE)
[1] 0.03593032

The later has been calculated in question 3; it must be doubled.

R> 2*pnorm(-0.46,mean=-1,sd=0.3,lower.tail=FALSE)
[1] 0.07186064

Interpretation The p-value for a two-sided test is that of a one-sided test, multiplied by 2.

For the two-sided test, the p-value of a patient with a value of log-dimer equal to −0.46 is 0.0718. At risk 5%, the patient is declared not at risk.

7/9/2020