Statistics
Estimators
Let θ be a parameter, the value of which is unknown and must be guessed from a sample (X1, …, Xn).
The empirical mean of a sample is a estimator of the theoretical mean of the variable. Before the data have been collected, the empirical mean is a random variable: $\bar{X} = \frac{X_1+ \ldots + X_n}n$.
Now, we collect or simulate data (x1, …, xn) from a Normal distribution with theoretical mean equal to μ = 10:R> n<-100 R> X<-rnorm(n, mean=10, sd=1) R> mean(X) [1] 9.88436
The empirical mean (mean(X)
) is a value computed from the sample X, it is the estimate of μ.
mean
, the same concept):
R> X<-rnorm(n, mean=10, sd=1) R> mean(X) [1] 10.09043Here are some basic examples of estimators Example: Estimator of the variance:
R> n<-100 R> X<-rnorm(n, mean=10, sd=1) R> var(X) [1] 1.06025Estimator of the probability of a binary event
R> n<-100 R> p<-0.3 R> X<-rbinom(n,1, p) R> mean(X) [1] 0.33
The empirical frequency is an unbiased consistent estimator of the probability of an event. The empirical mean is an unbiased consistent estimator the theoretical mean and the empirical variance is an unbiased consistent estimator of the theoretical variance.
Example: Illustration of the properties of the empirical mean.
N samples of size n are drawn from the normal distribution 𝒩(μ, σ). The empirical mean is computed for each of the N samples. The boxplot is represented and the true value is added as horizontal line. Code R :R> N<-1000 R> n<-100 R> mu<-10 R> X <- rnorm(n*N,mu,1) # all values R> X <- matrix(X,N,n) # N samples as rows R> Tmu1 <- rowMeans(X) # mean R> boxplot(Tmu1) R> abline(h=mu,col="red") # true value
Résultat :
All the values of the boxplot are closed to the true value (red line), meaning that the parameter μ is well estimated, and without bias.
Confidence intervals
Confidence intervals for a Gaussian sample
For a random sample of the normal distribution 𝒩(μ, σ) (Gaussian sample), one can compute exact confidence interval of the theoretical mean μ:
Example: Illustration of the notion of confidence interval.
It takes N samples of size n from the normal distribution 𝒩(μ, σ) with μ and σ chosen by the user. It computes the N confidence intervals for μ assuming σ known. It represents the intervals by blue horizontal segments and the true value of μ by a red vertical line.
Click on this link to play with confidence intervals
When the confidence level is 95%, around 5% of the confidence intervals (blue segments) do not contain the (red) true value. Be careful with the interpretation of a confidence interval. The theoretical confidence interval $[\bar{X}-t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}};\bar{X}+t_{1-\alpha/2}\frac{\sqrt{S^2}}{\sqrt{n}}]$ contains the true value μ with probability 95%. But each of the N realized intervals contains or not the true value. The probability of containing the true value has no sense for a realized confidence interval.
Play with the application to see the influence of * the confidence level * the sample size * the variance of the sample
Example: We simulate a random sample from a normal distribution.R> mu<-10 R> sig<-2 R> n<-100 R> X<-rnorm(n,mu,sig)Then we compute the confidence interval of μ assuming σ unknown.
R> t.test(X)$conf.int [1] 9.267687 9.962904 attr(,"conf.level") [1] 0.95
The first output is the confidence interval of μ from the sample X. The second output is the confidence level, 0.95 by default.
To change the level, one can use the input conf.level
. For example, the confidence interval with level 0.80 is
R> t.test(X, conf.level=0.80)$conf.int [1] 9.389276 9.841315 attr(,"conf.level") [1] 0.8
R> t.test(X, conf.level=0.99)$conf.int [1] 9.155184 10.075407 attr(,"conf.level") [1] 0.99Example Compute the confidence interval at level 0.98% of the variance of the following sample: (5.3, 5.2, 5.6, 4.9, 5.2, 4.7, 5.3, 4.8, 5.1, 5.4).
R> sample <- c(5.3,5.2,5.6,4.9,5.2,4.7,5.3,4.8,5.1,5.4) R> ms <- mean(sample) R> sds <- sd(sample) R> n <- length(sample) R> q <- qchisq(c(0.01,0.99),df=n-1) R> sqrt((n-1)*sds^2/rev(q)) [1] 0.180387 0.581085
Confidence intervals for large sample
Confidence intervals on probability for large sample
Example Simulate a binary sample with probability p = 0.7 of success. The size of the sample is n = 100. Compute the confidence interval at level 0.98% of the proportion, considering that the sample is large.
R> p <- 0.7 R> n <- 100 R> SS <- sample(c(0,1),n,replace=TRUE,prob=c(1-p,p)); mean(SS) [1] 0.6 R> freq <- mean(SS) R> freq [1] 0.6
Recall that for a binary sample taken values in {0, 1}, the empirical mean (mean
) is an estimator of the probability of success.
R> q <- qnorm(c(0.01,0.99)) R> sdf <- sqrt(freq*(1-freq)) R> freq+q*sdf/sqrt(n) [1] 0.4860327 0.7139673
Vocabulary of tests
A statistical test is a decision method that helps us to examine two opposing conjectures, called hypotheses and to validate or invalidate these hypotheses with a certain degree of confidence.
Null and alternative hypotheses
In a statistical test, one want to examine two opposing conjectures, also called hypotheses H0 and H1. These two hypotheses are mutually exclusive and exhaustive so that one is true to the exclusion of the other. Usually we choose the two hypotheses such that:The purpose of a statistical test is to determine which of the two hypotheses is true and which is false from information collected in a sample.
Example: Test of the mean: the null hypothesis is the mean is zero (H0: μ = 0) against the alternative hypothesis the mean is not zero (H1: μ ≠ 0).
In practice, the null hypothesis always contains the '=' sign, and the alternative hypothesis never contains just the '=' sign.
Notion of risks
When taking a decision, two different errors could occur:
Decision | ||
Truth | ||
OK | first error | |
2nd error | OK |
The first error is considered as the most serious. This is the one we want to control, that is we want a small probability of the first error. This is the notion of risk:
Example For an adult, the logarithm of the D-dimer concentration, denoted by X, is modeled by a normal random variable with expectation μ and standard deviation σ. The variable X is an indicator for the risk of thrombosis: it is considered that for healthy individuals, μ is −1; whereas for individuals at risk, μ is 0. In both cases, the value of σ is the same: 0.3.
- Dr. House does not want to worry his patients if there is no need to. What hypotheses H0 and H1 will he choose to test?
- Dr. Cuddy's point of view is that she'd rather worry a patient wrongly than not warn him of an actual risk. What hypotheses H′0 and H′1 will she choose to test?
Test statistic
The statistical test helps the statistician to decide between the two hypotheses, based on the information contained in the data. A rule of decision is defined based on a test statistic. The two notions are defined below.
A statistical test uses the observations to determine the statistical support for the null hypothesis H0 against the alternative H1.
It needs a test statistic:Example Dr. House will choose to reject too high values for X, ie to decide to reject H0 (and decide H1) when X has high values. Dr. Cuddy will choose to reject lower values of X, ie to decide to reject H0 (and decide H1) when X has low values.
One-sided or two-sided tests
According to the alternative, the test may beThe values ℓ and ℓ′ are chosen to control the probability of the error of rejecting H0 wrongly. This is an important notion in statistical testing, that is developed below.
The values ℓ and ℓ′ are chosen such that the first kind risk is controlled. The standard choices are:The first kind risk is usually α = 5%.
Example
- Give the decision rule for Dr House's test, at threshold 1%, and at threshold 5%.
The test is right-sided (Dr House will choose to reject too high values for X). The decision rule will be:
Reject H0 ⇔ X > l ,
where ℙH0[X > l]=α .
According to the null hypothesis H0, the test statistic X follows the ${\cal N}(-1,0.3)$ distribution. The bound of the rejection region is the quantile of that distribution at level 1 − α:
R> qnorm(0.95,mean=-1,sd=0.3) [1] -0.5065439 R> qnorm(0.99,mean=-1,sd=0.3) [1] -0.3020956
- Give the decision rule for Dr Cuddy's test, at threshold 1%, and at threshold 5%.
The test is left-sided (she will choose to reject lower values of X). The decision rule will be:
Reject H′0 ⇔ X < l′ ,
where ℙH′0[X < l′] = α .
Under hypothesis H′0, the test statistic X follows the ${\cal N}(0,0.3)$ distribution. The bound of the rejection region is the quantile of that distribution at the threshold 1 − α:
R> qnorm(0.05,mean=0,sd=0.3) [1] -0.4934561 R> qnorm(0.01,mean=0,sd=0.3) [1] -0.6979044
- Give the decision rule of the test for the null hypothesis H″0 : μ = −1 against the alternative one H″1 : μ ≠ −1, at threshold 5%.
This is a two-sided test. The decision rule will be:
Reject H″0 ⇔ X ∉ [l1, l2] ,
where ℙH″0[ X ∉ [l1, l2] ]=0.05 .
Under hypothesis H″0, the test statistic X follows the ${\cal N}(-1,0.3)$ distribution, hence $\frac{X-(-1)}{0.3}$ follows the ${\cal N}(0,1)$. The interval [l1 ; l2] must contain 95% of the values taken by a variable with ${\cal N}(-1,0.3)$ distribution. The interval centered at −1 is chosen:
R> qnorm(c(0.025,0.975),mean=-1,sd=0.3) [1] -1.5879892 -0.4120108
p-value
A last central notion in statistical tests is the p-value. It is used as a test statistic, with the following decision ruleExample
- A patient has a value of X equal to −0.46. Find the p-value of Dr. House's test.
The p-value is the threshold at which −0.46 would be the bound. Knowing the results of the first question, since −0.46 lies between −0.5065 and −0.3021, the p-value must be between 0.05 and 0.01. It is the probability under H0, that the variable X is higher than −0.46, that is the right tail at −0.46 of the ${\cal N}(-1,0.3)$ distrubution, that is
R> pnorm(-0.46,mean=-1,sd=0.3,lower.tail=FALSE) [1] 0.03593032
- A patient has a value of X equal to −0.46. Find the p-value for the two-sided test.
The p-value is the threshold for which the observed value would be a bound of the rejection region. That rejection region is centered at −1. The other bound should be −1 − ( − 0.46 − ( − 1)) = −1.54. The normal distribution 𝒩(−1, 0.3) being symmetric at −1, the probabilility to be smaller than −1.54 is equal to the probability to be larger than −0.46. The later is
R> pnorm(-0.46,mean=-1,sd=0.3,lower.tail=FALSE) [1] 0.03593032
The later has been calculated in question 3; it must be doubled.
R> 2*pnorm(-0.46,mean=-1,sd=0.3,lower.tail=FALSE) [1] 0.07186064