Probabilities

Random experiments
Introduction to randomness
- Random variables
Probability distributions
Discrete distributions
- Bernoulli distribution
- Binomial distribution
Continuous distributions
Fluctuation interval
- Fluctuation interval for the normal distribution
Limit theorems

Random experiments

Introduction to randomness

A random experiment may have different outcomes when repeated. Let us give some definitions.

event: true or false according to the outcome of a random experiment.
independence: If two random experiments are independent, any event depending on the first is independent from any event depending on the second.
probability of an event: proportion of individuals for which the event is true.

The two main rules to compute the probability of an event are:

disjoint events: The probability of union of two disjoint events (i.e. which cannot be both true) is the sum of the probabilities of the two events.
independent events: The probability of the intersection of two independent events is the product of the probabilities of the two events.

In R, values of a random experiment can be obtained by simulation. The first function to simulate the realisation of a random experiment is the sample function.

The function sample draws random samples of outcomes.

sample(x): random permutation of the values in vector x.
sample(x,n): random subset of size n out of vector x (no repetitions, n cannot be larger than the length of x).
sample(x,n,replace=TRUE): random sample of size n from the values in vector x, with equal probabilities (possible repetitions).
sample(x,n,replace=TRUE,prob=p): random sample of size n from the values in vector x, with probability distribution specified in vector p.

Example Let us define the vector x=(0,1).

R> x<-c(0,1)

sample draws a random permutation of x.

R> sample(x)
[1] 0 1
R> sample(x)
[1] 1 0

Try to repeat 10 times. How many times did you get each of the two permutations?

Let us now sample 100 random values from x, repetitions allowed:

R> A <- sample(x,100,replace=TRUE)             # random sample

table(A) gives the number of 0 and 1 obtained in A. The sum of the two absolute frequencies is equal to 100, the number of random values that have been drawn.

R> table(A)
A
 0  1 
44 56

If you run again the instruction sample, you will obtain different absolute frequencies.

R> A <- sample(x,100,replace=TRUE)             # random sample
R> table(A)
A
 0  1 
53 47

Example We can also draw values from a qualitative variable. Set x=(a,b,c).

R> x<-c("a", "b", "c")

A random permutation of x is obtained with

R> sample(x)
[1] "a" "c" "b"

A random subset of x is obtained with

R> sample(x,2)
[1] "b" "c"

We can also draw values of x with possible repetitions, as before:

R> A <- sample(x,100,replace=TRUE)
R> table(A)
A
 a  b  c 
35 28 37

Random variables

A random variable X is a value resulting from an experiment, the outcome of which is (regarded as) unpredictable.

The probability distribution of a random variable permits to calculate the probability of any event depending of that random variable.

Example

Random variables are categorized in two families, discrete and continuous.

Discrete random variable: can take a finite or countable (usually integer) number of different values.
Continuous random variable: can take any value over an interval (usually [0, 1], or the positive half line, or the whole real line).

Probability distributions

In this section, we define the main characteristics of a random variable. These notions are illustrated below for each distribution. The notions below depend on the random variables only through their probability distributions. They are called theoretical, as opposed to empirical when evaluated over a statistical sample.

In what follows X and Y denote numeric random variables.

Theoretical mean

Let us start with the most intuitive characteristic: the mean.

The mean of a probability distribution is its center of gravity.

The theoretical mean of a random variable X is the mean of its probability distribution.

It is denoted by 𝔼(X).

The following rules apply for the mean:

The mean of the sum of two random variables is the sum of their means:
𝔼(X + Y)=𝔼(X)+𝔼(Y)
The mean of a rescaled variable is the rescaled mean:
𝔼(aX)=a𝔼(X)
The mean of the product of two independent variables X and Y is the product of their means:
𝔼(XY)=𝔼(X)×𝔼(Y)

Theoretical variance

The variance of a probability distribution is the mean of square minus the square of mean.

It is denoted by Var(X).

The following rules apply for the variance:

Var(X)=𝔼(X²)−(𝔼(X))² *Var(aX)=a²Var(X)
if X and Y are independent, Var(X + Y)=Var(X)+Var(Y)

The theoretical standard deviation is the square root of the variance.

Cumulative distribution function

The cumulative distribution function (cdf) maps a value x onto the probability to be less or equal to the value: F(x)=Prob(X ≤ x).

Quantile function

The quantile function is the inverse of the cumulative distribution function. It maps a probability u onto the value x such that u = F(x).

Discrete distributions

A discrete probability distribution assigns a probability to the elements of a set, the sum of all probabilities being equal to 1.

One has to define the set of elements {x₁, …, x_i, …} and their associated probabilities Prob(x_i)=p_i and $p_i = 1 $.

Note that its mean is 𝔼(X) = ∑ x_i p_i.

Classical discrete distributions are Bernoulli, binomial, Geometric, Hypergeometric. We detail below the two first ones.

Bernoulli distribution

The Bernoulli distribution with parameter p assigns the probability p to 1 (success), and 1 − p to 0 (failure).

Its mean is p, its variance is p(1 − p).

Simulation from the Bernoulli distribution

A sample of the Bernoulli distribution can be drawn in R using the function rbinom (random draw from the binomial distribution, see below) with parameter n set to 1.

Example

R> p<-0.5
R> rbinom(1,1,p)
[1] 1
R> rbinom(10,1,p)
 [1] 0 0 1 1 1 0 0 0 1 0

Only two values can be drawn (0 and 1). Two runs of the same code give two different random samples. It is very unlikely to obtain the same sequence of 0 and 1 if you run the code on your own computer.

The empirical proportion of 1 changes with the sample, but is close to the theoretical proportion p when the size of the sample is high:

R> prop.table(table(rbinom(100,1,p)))

   0    1 
0.55 0.45 
R> prop.table(table(rbinom(1000,1,p)))

    0     1 
0.517 0.483 
R> prop.table(table(rbinom(10000,1,p)))

     0      1 
0.5015 0.4985

Increasing the size of the sample gives a proportion of 1 that converges to the theoretical proportion (mean) p. This is the law of large numbers (see below).

R computes the probability, cumulative distribution, and quantile functions for standard families of probability distributions.

For the bernoulli distribution,

Its density function, also called probability function for a discrete variable is dbinom(x,1,p).
Its cumulative distribution function (cdf) is pbinom(x,1,p): input a value x, output a probability P.
Its quantile function is qbinom(P,1,p): input a probability P, output a value x.

Example The probability of success (value 1) of a Bernoulli distribution with parameter p=0.2is

R> dbinom(1,1,0.2)
[1] 0.2

The probability of failure is

R> dbinom(0,1,0.2)
[1] 0.8

Binomial distribution

The binomial distribution with parameters n and p, denoted by ℬ(n, p) is the distribution of the number of successes in n independent trials, when the probability of success is p.

Its mean is np, its variance is np(1 − p).

Example From past experience, it is known that a certain surgery has a 90% chance to succeed. This surgery is going to be performed on 5 patients. Let X be the random variable equal to the number of successes out of the 5 attempts. X follows a binomial distribution with parameters n = 5 and p = 0.9.

Simulation from the binomial distribution

A sample of the binomial distribution can be drawn in R using the function rbinom.

Example Simulate one value of the binomial distribution with parameters 5 and 0.9.

R> n<-5
R> p<-0.9
R> x<-rbinom(1,n,p)
R> x
[1] 5

Interpretation x is the number of successes among 5 surgeries.

We can also compute the probability to have 0 success, exactly 1 success (among the 5 surgeries), 2 succcesses, etc. These theoretical probabilities are computed with the function pbinom.

For the binomial distribution

Its density function is dbinom(x,n,p).
Its cumulative distribution function (cdf) is pbinom(x,n,p): input a value x, output a probability P.
Its quantile function is qbinom(P,n,p): input a probability P, output a value x.

Example Apply functions dbinom, pbinom, qbinom for the surgery example.

R> dbinom(4,n,p)
[1] 0.32805

Interpretation The probability to obtain exactly 4 successes among the 5 surgeries is 32.805%.

R> dbinom(2,n,p)
[1] 0.0081

Interpretation The probability to obtain exactly 2 successes among the 5 surgeries is 0.81%.

Cumulative probabilities are the probabilities to obtain at most a certain value, or at least a certain value. They are calculated with pbinom, using the option lower.tail=FALSE for the condition at least.

R> pbinom(4,n,p)
[1] 0.40951

Interpretation The probability to obtain at most 4 successes among the 5 surgeries is 40.951%.

R> pbinom(3,n,p, lower.tail=FALSE)
[1] 0.91854

Interpretation The probability to obtain at least 3 successes among the 5 surgeries is 91.854%.

The (theoretical) cumulative distribution has to be compared to the empirical cumulative distribution function (ecdf) of a binomial sample.

Code R :

R> X<-rbinom(1000,n,p)
R> plot(ecdf(X))
R> curve(pbinom(x,n,p), col='red', add=TRUE)

Résultat :

Interpretation The cumulative distribution function is the theoretical counterpart of the ecdf. When the sample size increases, they are very close.

Continuous distributions

A continuous probability distribution assigns a probability to the subsets of an interval.
Its density function is a positive function f with integral equal to 1.
The density is the derivative of the cumulative distribution function.
The mean is the integral of xf(x).

Uniform distribution

The uniform distribution on the interval (0, 1) has a constant density, equal to 1 over that interval. It has mean 1/2, and variance 1/12.

Example Simulating from the uniform distribution on the interval [0, 5] is performed with the function runif:

R> runif(1,0,5)
[1] 1.803786

For the uniform distribution on the interval (min, max)

Its density function is dunig(x,min,max).
Its cumulative distribution function (cdf) is punif(x,min,max): input a value x, output a probability p.
Its quantile function is qunif(p,min,max): input a probability p, output a value x.

Normal distribution

The standard normal distribution has density

$$f(x) = \frac{1}{\sqrt{2\pi}}\exp^{-x^2/2}$$

over the whole real line.

It has mean 0, and variance 1 (it is centered and reduced).

The normal or Gaussian distribution 𝒩(μ, σ) has mean μ, standard deviation σ.

Example Simulate 1000 values of the normal distribution with parameters μ = 1 and σ = 10.

R> mu<-1
R> sig<-10
R> x<-rnorm(1000, mu, sig)
R> mean(x)
[1] 1.145378

Interpretation mu is the theoretical mean of the sample. mean(x) is the empirical mean of the sample. They are close.

Code R :

R> hist(x)

Résultat :

Interpretation The histogram of a (large) Gaussian sample is symetric around the theoretical mean.

For the normal distribution

Its density function is dnorm(x,mean=mu,sd=sigma).
Its cumulative distribution function (cdf) is pnorm(x,mean=mu,sd=sigma): input a value x, output a probability p.
Its quantile function is qnorm(p,mean=mu,sd=sigma): input a probability p, output a value x.

The theoretical density is close to the histogram represented in the relative frequencies scale. Code R :

R> hist(x, prob=TRUE)
R> curve(dnorm(x, mu, sig), -40, 30, col="red", add=TRUE)

Résultat :

Interpretation The density is the theoretical counterpart of the histogram. When the sample size increases, they are very close. The density can thus be interpreted as the probability of a very small intervals.

The (theoretical) cumulative distribution has to be compared to the empirical cumulative distribution function (ecdf) of a binomial sample. Code R :

R> X<-rnorm(1000,mu,sig)
R> plot(ecdf(X))
R> curve(pnorm(x,mu,sig), col='red', add=TRUE)

Résultat :

Interpretation The cumulative distribution function is the theoretical counterpart of the ecdf. When the sample size increases, they are very close.

Let us continue with some properties of the normal distribution.

The family of normal distributions is invariant through scaling.

If X follows the 𝒩(μ, σ), then $\frac{X-\mu}{\sigma}$ (z-scores) follows the 𝒩(0, 1).
If Y follows the 𝒩(0, 1) then μ + σY follows the 𝒩(μ, σ).

The family of normal distributions is invariant through linear combinations of independent variables.

If X and Y are two independent random variables, with respective distributions ${\cal N}(\mu_x,\sigma_x)$ and ${\cal N}(\mu_y,\sigma_y)$, then:

aX + bY follows the ${\cal N}\left(a\mu_x+b\mu_y,\sqrt{a^2\sigma_x^2+b^2\sigma_y^2}\right)$,
in particular X − Y follows the ${\cal N}\left(\mu_x-\mu_y,\sqrt{\sigma_x^2+\sigma_y^2}\right)$.

The function qqnorm draws the empirical quantiles of a sample against the quantiles of the standard normal distribution.

Example Let us simulate a sample X from the normal distribution 𝒩(0, 1) and draw its qqplot. Code R :

R> X<-rnorm(1000)
R> qqnorm(X)
R> abline(0,1)

Résultat :

Interpretation The qqnorm of a Gaussian sample is close to a straight line. It means that the quantiles of the sample are close to the quantiles of a Gaussian distribution.

Due to invariance through scaling, if the sample is drawn from the normal distribution 𝒩(μ, σ), the plot is close to the straight line with intercept μ and slope σ. Code R :

R> mu<-5
R> sig<-3
R> X<-rnorm(1000, mu, sig)
R> qqnorm(X)
R> abline(mu,sig)

Résultat :

Example The total nutritional intake (in Kcal per day) in the control population has mean 2970 and standard deviation 251. Among runners, the mean is 3350 and the standard deviation 223.

Consider a person taken at random in the control population. Would you say that the chances that the intake is smaller than 3000 are larger than 1/2?

R> muC<-2970
R> sdC<-251
R> pnorm(3000, muC,sdC)
[1] 0.5475691

Interpretation The probability of an intake to be smaller than 3000 is 54.76%, larger than 1/2.

What proportion of intakes in the control population are smaller than 2600?

R> pnorm(2600, muC,sdC)
[1] 0.07022685

Interpretation Only 7% of intakes in the control population are smaller than 2600.

What proportion of intakes in the runners population are smaller than 2600?

R> muR<-3350
R> sdR<-223
R> pnorm(3000, muR,sdR)
[1] 0.05826496

Interpretation Only 5.8% of intakes in the runner population are smaller than 2600.

Which lower bound is such that 1% of the control population is below?

R> qnorm(0.01, muC,sdC)
[1] 2386.087

Interpretation 1% of the control population is below 2386.

Which upper bound is such that 1% of runners are above?

R> qnorm(0.01, muR,sdR,lower.tail=FALSE)
[1] 3868.776

Interpretation 1% of the runners is the below 3869.

We continue the presentation of continuous distributions with two families, Student and Chi-squared, that are linked to the Gaussian distribution.

Student distribution

The Student distribution is defined from a Gaussian sample X₁, …, X_n, that is from n variables that are independent and identically distributed from a Gaussian distribution 𝒩(μ, σ). Let us denote $\overline{X}$ the empirical mean of the sample, and S² its empirical variance. The Student distribution is defined as follows:

The transformed variable $\displaystyle{\sqrt{n}\,\frac{\overline{X}-\mu}{\sqrt{S^2}}}$ follows the Student T distribution 𝒯(n − 1) with n − 1 degrees of freedom.

Its mean is 0. Its variance is (n − 1)/(n − 2) when n ≥ 3.

The Student distribution is used for theoretical mathematical study of statistical notions such than confidence intervals, statistical tests.

The R functions associated to the Student distribution are named:

For the Student distribution

Its density function is dt(x,df=n).
Its cumulative distribution function (cdf) is pt(x,df=n): input a value x, output a probability p.
Its quantile function is qt(p,df=n): input a probability p, output a value x.

Example Simulate a sample from a Student distribution with 10 degrees of freedom (using the function rt) and represent its distribution: Code R :

R> X<-rt(1000, df=5)
R> hist(X, prob=TRUE, xlim=c(-5,5), breaks=30)

Résultat :

Interpretation The histogram is symetric, looks like the histogram of a Gaussian sample, but with more heavy tails.

Let us try to compare both distributions, the empirical distribution is the histogram, the theoretical density of the Student distribution is in red, and the theoretical density of a standard normal distribution is in blue. Code R :

R> X<-rt(1000, df=5)
R> hist(X, prob=TRUE, xlim=c(-5,5), breaks=30)
R> curve(dt(x,df=5), col="red", add=TRUE)
R> curve(dnorm(x), col="blue", add=TRUE)

Résultat :

Interpretation The Student distribution has more heavy tails than the Gaussian distribution.

When the degree of freedom increases, the difference with the normal distribution decreases: Code R :

R> X<-rt(1000, df=100)
R> hist(X, prob=TRUE, xlim=c(-5,5), breaks=30)
R> curve(dt(x, df=100), col="red", add=TRUE)
R> curve(dnorm(x), col="blue", add=TRUE)

Résultat :

Interpretation The Student distribution with a large degree of freedom is close to the standard normal distribution

The difference between the Student and the Gaussian distributions can also been viewed via the qqnorm representations. Code R :

R> X<-rt(1000, df=5)
R> qqnorm(X)
R> abline(0,1)

Résultat :

Increasing the degree of freedom yields to a qqnorm close to a straight line. Code R :

R> X<-rt(1000, df=100)
R> qqnorm(X)
R> abline(0,1)

Résultat :

Chi-squared distribution

The transformed variable $\displaystyle{(n-1)\,\frac{S^2}{\sigma^2}}$ follows the chi-squared distribution 𝒳²(n − 1) with n − 1 degrees of freedom.

Its mean is n − 1. Its variance is 2(n − 1).

The Chi-squared distribution is used for theoretical mathematical study of statistical notions such than confidence intervals, statistical tests.

The R functions associated to the Chi-squared distribution are named:

For the Chi-squared distribution

Its density function is dchisq(x,df=n).
Its cumulative distribution function (cdf) is pchisq(x,df=n): input a value x, output a probability p.
Its quantile function is qchisq(p,df=n): input a probability p, output a value x.

Example Simulate a sample from a Chi-squared distribution with 10 degrees of freedom (using the function rchisq) and represent its distribution: Code R :

R> X<-rchisq(1000, df=5)
R> hist(X, prob=TRUE)
R> curve(dchisq(x, df=5), add=TRUE, col="red")

Résultat :

Interpretation A Chi-squared variable takes only positive values. The histogram is right skewed.

Increasing the degree of freedom translates the distribution: Code R :

R> X<-rchisq(1000, df=10)
R> hist(X, prob=TRUE)
R> curve(dchisq(x, df=10), add=TRUE, col="red")

Résultat :

Fluctuation interval

A fluctuation interval with level q is such that the probability of values inside is q. It can be:

Two-sided: (x₁, x₂). The probability to be below x₁ and the probability to be above x₂ are both equal to (1 − q)/2.
Left-sided: (min, x₁). The probability to be below x₁ is q. The left bound min is the minimal value of the distribution (it can be −∞).
Right-sided: (x₂, max) The probability to be above x₁ is q. The right bound max is the maximal value of the distribution (it can be +∞).

Fluctuation interval for the normal distribution

The normal.fluctuation function has been developed to illustrate and understand the fluctuation interval for a normal distribution. It takes a probability, the parameters of the normal distribution and the alternative (two-sided, less, greater). It returns the two quantiles that define the fluctuation interval.

The two-sided fluctuation interval with level 0.5 of the standard normal distribution is computed [ − 0.6745, 0.6745].

R> normal.fluctuation(0.5, mean=0, sd=1)

Alt text

For level 0.8, it is [ − 1.2816, 1.2816].

Alt text

For level 0.95, it is the well-known [ − 1.96, 1.96].

Alt text

Interpretation The two-sided fluctuation interval of the standard normal distribution is symetric around 0. It increases with the level.

The left-sided fluctuation interval with level 0.95 is ] − ∞, 1.6449].

Alt text

Interpretation 95% of the values are below 1.6449.

The right-sided fluctuation interval with level 0.95 is [1.6449, +∞[.

Alt text

Limit theorems

Statistics are based on two powerful limit theorems, called the law of large numbers and the central limit theorem.

Law of large numbers

The Law of Large Numbers implies that over a large number of independent repetitions of a given experiment, the relative frequency of an event gets close to its theoretical probability.

Example Define x = (0, 1) and for n=1000, we simulate a sample of n random values from x.

R> x<-c(0,1)
R> n<-1000
R> A<-sample(x,n,replace=TRUE)

Then, we compute the vector of cumulative sums of A, divide it by 1:n to compute the consecutive means of A.

R> M<-cumsum(A)/1:n

Then we plot M against 1:n as blue dots. The law of large numbers says that these empirical means (M) converge to the theoretical mean, which is equal to 0.5. Code R :

R> plot(1:n,M, col="blue", pch='.')
R> abline(h=0.5,col="red")

Résultat :

Interpretation The empirical mean (blue) converges to the true theoretical mean (red). This is the law of large numbers.

Central limit theorem

The Central Limit Theorem says that, if a large number of independent random variables having the same distribution are added, their sum approximately follows a normal distribution.

Normal approximations

As a consequence of the central limit theorem, some distributions can be approximated by the normal distribution with same mean and standard deviation.

The binomial distribution ℬ(n, p) is close to the normal distribution $\mathcal{N}\left(np,\sqrt{np(1-p)}\right)$, for n large.
The Student distribution 𝒯(n) is close to the normal distribution 𝒩(0, 1), for n large.
The chi-squared distribution 𝒳²(n) is close to the normal distribution $\mathcal{N}\left(n,\sqrt{2n}\right)$, for n large.
For a large sample from any distribution with mean μ, standard deviation σ, the z-scores $\displaystyle{\sqrt{n}\,\frac{\overline{X}-\mu}{\sigma}}$ and $\displaystyle{\sqrt{n}\,\frac{\overline{X}-\mu}{\sqrt{S^2}}}$ approximately follow the 𝒩(0, 1).

Example From past experience, it is known that a certain surgery has a 90% chance to succeed. This surgery is performed by a certain clinic 400 times each year. Let N be the number of successes next year.

The exact model for N is a binomial distribution with parameters n = 400 and p = 0.9 ($(n,p)). The normal approximation is 𝒩(360, 6). The two densities are represented (blue for the binomial density, red for the normal density): Code R :

R> n<-400
R> p<-0.9
R> curve(dnorm(x,mean=n*p, sd=sqrt(n*p*(1-p))), from=320, to=n, col="red", ylab="")
R> points(dbinom(0:n,n,p), col="blue")

Résultat :

Let us compute the probabilities for the clinic to perform successfully the surgery at least 345 times, in the exact and approximate models.

R> n<-400
R> p<-0.9
R> ms<-n*p
R> sds<-sqrt(n*p*(1-p))
R> pbinom(344,n,p,lower.tail=FALSE)            # exact model
[1] 0.9933626
R> pnorm(344,ms,sds,lower.tail=FALSE)          # approximate model
[1] 0.9961696

The insurance accepts to cover a certain number of failed surgeries; that number has only a 1% chance to be exceeded. Let us compute this number, in the exact and approximate models?

R> n-qbinom(0.01,n,p)                          # exact model
[1] 55
R> n-qnorm(0.01,ms,sds)                        # approximate model
[1] 53.95809