Two sample tests

Main Page

Association of two variables

We focus on testing the association or dependence of two variables observed on a given population. The question is: does the distribution of values of the first variable vary, given the values of the second? According to the nature (continuous, discrete, or binary) of the two variables, many tests can be used. We shall present only a few of them.

Continuous against binary

Consider a continuous variable, denoted by Cont (e.g. age), and a binary variable, denoted by Bin (e.g. gender). Descriptive statistics consists in computing the summaries of Cont by values of Bin, then display box plots of Cont against Bin:

by(Cont, Bin, summary)

boxplot(Cont~Bin)

If there is an association between Cont and Bin, the two summaries are different; of the two box plots, one is above the other, or one is larger than the other, etc.

Example In the titanic data set, study the link between the age and the gender.

R> titanic <- read.table("data/titanic.csv",header=TRUE,sep=";")
R> G <- titanic[,"gender"]
R> A <- titanic[,"age"]
R> by(A,G,summary)
G: F
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1667 19.0000 27.0000 28.6871 38.0000 76.0000 
------------------------------------------------------------ 
G: M
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3333 21.0000 28.0000 30.5852 39.0000 80.0000

Code R :
R> boxplot(A~G)

Résultat : image

The two means are different (28.69 years old for women, 30.59 for men), the two boxplots are close, even if the distribution for the men is more right skewed.

The distributions can be represented through the plot of the empirical distribution function. Denote by X and Y the two subsamples of Cont, according to the values of Bin (e.g. X: age of women, Y: age of men). The ecdf are obtained with

plot(ecdf(X),col="red »)

lines(ecdf(Y),col="blue »)

Back to the example

Code R :
R> Aw <- A[G=="F"]                             # ages of women
R> Am <- A[G=="M"]                             # ages of men
R> plot(ecdf(Aw),col="red")                    # ages of women
R> lines(ecdf(Am),col="blue")                  # ages of men

Résultat : image

The two ecdf are very closed.

The summaries and the plots are not sufficient to decide if the age means of men and women were different. To decide if the link between age and gender is significant, three tests can be applied.

The link between a continuous variable and a binary variable can be tested by three tests. Denote by X and Y the two subsamples of the continuous variable Cont, according to the values of the binary variable Bin.

  • Student’s t-test: The null hypothesis H0 is: “the means of X and Y are equal”. The command is

t.test(X,Y,alternative)

or equivalently

t.test(Cont~Y, alternative)

  • Fisher’s F-test: The null hypothesis H0 is: “the variances of X and Y are equal”. The command is

var.test(X,Y,alternative)

  • Kolmogorov-Smirnov test: The null hypothesis H0 is: “the ecdf’s of X and Y are equal”. The command is

ks.test(X,Y, alternative)

In all three cases, the alternative is in two.sided (default), less, greater. The interpretation of less and greater is relative to the first variable X:

Recall that if the ecdf of X is less than that of Y, the values of X tend to be larger than those of `Y.

Example In the titanic data set, study the link between the age and the gender.

R> t.test(Aw,Am)

	Welch Two Sample t-test

data:  Aw and Am
t = -2.0497, df = 798.36, p-value = 0.04072
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.71595773 -0.08036699
sample estimates:
mean of x mean of y 
 28.68707  30.58523 

R> t.test(A~G)

	Welch Two Sample t-test

data:  A by G
t = -2.0497, df = 798.36, p-value = 0.04072
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.71595773 -0.08036699
sample estimates:
mean in group F mean in group M 
       28.68707        30.58523

The two instructions are equivalent.

The p-value is 0.04072. At risk 5%, we reject H0. The two means are significantly different.

The mean of women is less than that of men. To test if it is significant, we apply a one-sided test:

R> t.test(A~G, alternative="less")

	Welch Two Sample t-test

data:  A by G
t = -2.0497, df = 798.36, p-value = 0.02036
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
       -Inf -0.3731636
sample estimates:
mean in group F mean in group M 
       28.68707        30.58523

The p-value is 0.02036. At risk 5%, we reject H0. Women were in average significantly younger than men.

We can also compare the two variances and the two ecdf through the Fisher's (variance) test and Kolmogorov-Smirnov test.

R> var.test(Aw,Am)

	F test to compare two variances

data:  Aw and Am
F = 1.0419, num df = 387, denom df = 657, p-value = 0.6442
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.8740042 1.2473141
sample estimates:
ratio of variances 
          1.041945 

R> ks.test(Aw,Am,alternative="greater")

	Two-sample Kolmogorov-Smirnov test

data:  Aw and Am
D^+ = 0.084636, p-value = 0.03029
alternative hypothesis: the CDF of x lies above that of y

The p-value of the Fisher’s test (var.test) is 0.6442. At risk 5%, we do not reject H0. The variances of men and women are not different. The p-value of the Kolmogorov-Smirnov’s test is 0.03029. At risk 5%, we reject H0. The women were significantly younger than men.

Continuous against discrete

This section considers the case of tha continuous variable and a discrete variable (not necessarily binary).

Consider a continuous variable, denoted by Cont (e.g. heart rate), and a discrete variable, denoted by Disc (e.g. type of disease). Here again, the first thing to do is to display summaries of Cont by values of Disc, and box plots of Cont against `Disc.

by(Cont,Disc,summary)

boxplot(Cont~Disc)

Example In the titanic data set, study the link between the age and the class.

Code R :
R> titanic <- read.table("data/titanic.csv",header=TRUE,sep=";")
R> P <- titanic[, "pclass"]
R> A <- titanic[,"age"]
R> by(A,P,summary)
R> boxplot(A~P)

Résultat : image

The three boxplots are different. The first class passengers were globally older than second and third class passengers. The distributions of class 2 and 3 are right-skewed.

The discrete variable Disc divides the population in as many groups, as different values: if there are 3 types of disease, then there are 3 groups of patients, one per type of disease. If there is no dependence between Cont and Disc, the means of Cont within the different groups should be equal.

The link between a continuous variable and a discrete variable can be tested by one-way analysis of variance, or anova. The null hypothesis is: “the means within the groups are all equal”. There is only one possible alternative, which is: “there are differences between group means”.

oneway.test(Cont~Disc)

If there are only two groups, the variable is binary, and the tests of the previous section apply. In that case, the two-sided T-test will return exactly the same p-value as the one-way anova.

If the p-value is small, the conclusion is that there are significant differences between the means of the different groups. Those differences should be visible on the box plot. Differences between the means of two particular groups, or between the mean of one group and the mean in the rest of the population, can be tested using a one-sided T-test.

Without giving mathematical details, we shall now explain the expression analysis of variance. The global variance of Cont can be written as the sum of two variances: global variance = variance between groups + variance within groups

1.  The variance between groups is the variance of the means of groups: the smaller the differences between means, the smaller the variance between groups.

2.  The variance within groups is the mean of variances of groups. 2


If all group means are equal (null hypothesis of the test), then the variance between groups is null. The higher the ratio of the first variance to the second, the more significant the differences between the means. That ratio is returned by the function oneway.test, together with the corresponding p-value.

Back to the example In the titanic data set, study the link between the age and the class.

R> oneway.test(A~P)

	One-way analysis of means (not assuming equal variances)

data:  A and P
F = 99.967, num df = 2.00, denom df = 546.72, p-value < 2.2e-16

The p-value is very small (<2.2e-16). We reject H0. The three passengers classes have significantly different means of age.

The next step is to test the difference between two specific classes. This can be done with a T-test.

R> A1 <- A[P==1]                               # ages of first class
R> A2 <- A[P==2]                               # ages of second class
R> A3 <- A[P==3]                               # ages of third class passengers
R> t.test(A1,c(A2,A3),alternative="greater")   # first against others

	Welch Two Sample t-test

data:  A1 and c(A2, A3)
t = 13.01, df = 454.27, p-value < 2.2e-16
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 11.12341      Inf
sample estimates:
mean of x mean of y 
 39.15992  26.42290 

R> t.test(A1,A2,alternative="greater")         # first against second

	Welch Two Sample t-test

data:  A1 and A2
t = 7.9947, df = 542.78, p-value = 3.932e-15
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 7.663748      Inf
sample estimates:
mean of x mean of y 
 39.15992  29.50670 

R> t.test(A1,A3,alternative="greater")         # first against third

	Welch Two Sample t-test

data:  A1 and A3
t = 14.129, df = 499.8, p-value < 2.2e-16
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 12.67057      Inf
sample estimates:
mean of x mean of y 
 39.15992  24.81637 

R> t.test(A2,c(A1,A3))                         # second against others

	Welch Two Sample t-test

data:  A2 and c(A1, A3)
t = -0.50225, df = 475.13, p-value = 0.6157
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.450859  1.453016
sample estimates:
mean of x mean of y 
 29.50670  30.00563 

R> t.test(A2,A3,alternative="greater")         # first against others

	Welch Two Sample t-test

data:  A2 and A3
t = 4.6948, df = 470.7, p-value = 1.753e-06
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 3.043793      Inf
sample estimates:
mean of x mean of y 
 29.50670  24.81637

First class passengers were significantly older (in mean) than those of second and third classes. Second class passengers were significantly older (in mean) than those of third class.

Continuous against continuous

The dependence between two continuous variables X and Y, observed over the same population, will be tested through their correlation. The first thing to do is a scatter plot of the two variables.

plot(X,Y)

If the correlation is positive, the two variables tend to vary in the same direction: if one increases, the other one does too. If the correlation is negative, the variables vary in opposite directions. The closer the correlation is to plus or minus 1, the closer the scatter plot to a straight line, and the stronger the association of the two variables.

The link between two continuous variables can be tested by a correlation test. The null hypothesis is: “the correlation is equal to 0” (X and Y are uncorrelated). The testing command is:

cor.test(X,Y,alternative)

The alternative is in:


Example 2 In the tauber data set, study the link between age, height and weight.

R> TA <- read.table("data/tauber.csv", header=TRUE,sep=";")
R> head(TA)
  gender age height weight
1      F  68    113   20.0
2      M  74    116   17.5
3      M  69    120   23.0
4      M  72    121   25.0
5      M  73    114   17.0
6      M  65    115   20.0

The correlations between each pair of variables are plotted with Code R :
R> A <- TA[,"age"]
R> H <- TA[,"height"]
R> W <- TA[,"weight"]
R> AHW <- TA[,2:4]
R> pairs(AHW)

Résultat : image

The plot is equivalent too

R> plot(AHW)

or

R> plot(A,H)
R> plot(A,W)
R> plot(H,W)

The scatter plots have a linear shape.

Their correlation is obtained with

R> cor(AHW)
             age    height    weight
age    1.0000000 0.3671279 0.2214119
height 0.3671279 1.0000000 0.7186317
weight 0.2214119 0.7186317 1.0000000

The correlation between age and height is 0.3671279. The correlation between height and weight is 0.7186317.

The correlation between age and height is smaller than that betweenheightandweight` but we can not directly conclude if they are significantly different from zero. There is not a unique threshold on the scale of the correlation to conclude if a correlation is different from zero or not. We need to apply a correlation test to decide.

R> cor.test(A,H,alternative="greater")

	Pearson's product-moment correlation

data:  A and H
t = 21.214, df = 2889, p-value < 2.2e-16
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
 0.3403532 1.0000000
sample estimates:
      cor 
0.3671279 

R> cor.test(A,W,alternative="greater")

	Pearson's product-moment correlation

data:  A and W
t = 12.204, df = 2889, p-value < 2.2e-16
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
 0.1921155 1.0000000
sample estimates:
      cor 
0.2214119 

R> cor.test(H,W,alternative="greater")

	Pearson's product-moment correlation

data:  H and W
t = 55.546, df = 2889, p-value < 2.2e-16
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
 0.7035029 1.0000000
sample estimates:
      cor 
0.7186317

The three p-values are very small. We reject H0. The three correlations are significantly greater than 0.

Discrete against discrete

The conditional distributions of two discrete variables X and Y, can be visualized using contingency tables and barplots.
  • conditional frequencies of Y, given each value of X

barplot(prop.table(table(X,Y),1),beside=TRUE)

  • conditional frequencies of X, given each value of Y

barplot(prop.table(table(Y,X),1),beside=TRUE)

If the two variables are independent (null hypothesis), then the conditional frequencies of one variable, should not depend on the values of the other. Visually, the bar heights should be equal on the bar plot.

Example In the titanic data set, let us compute the conditional frequencies of the survival given the passengers class:

R> S<-titanic[ ,"survived"]
R> prop.table(table(S,P),1)
     P
S             1         2         3
  no  0.1663974 0.2358643 0.5977383
  yes 0.4238876 0.2693208 0.3067916

Code R :
R> barplot(prop.table(table(S,P),1),beside=TRUE)

Résultat : image

Interpretation The conditional frequencies of the survival are different depending on the class: third class passengers have less survived than first class.

The link between two discrete variables X and Y is studied with the chi-squared test:

H0: “X and Y are independent” and H1: “X and Y are dependent”.

The command is:

chisq.test(table(X,Y))

The parameter of the chi- squared distribution is the product of the number of values in X minus one, times the number of values in Y minus one. For instance, if there are 3 values in X and 4 in Y, the parameter will be df = (3-1)*(4-1) = 6. The function chisq.test automatically calculates df.

Example Let us test the link between class and survival in titanic dataset.

R> chisq.test(table(S,P))

	Pearson's Chi-squared test

data:  table(S, P)
X-squared = 107.5, df = 2, p-value < 2.2e-16

Interpretation The p-value is very small, H0 is rejected: the two variables are dependent. There exists a link between the class of passengers and their survival.

Binary against binary

In the specific case of two binary variables, one can also use a proportion test to decide whether the proportion of the population with a given value for X is the same, over the two values of Y.

The link between two binary variables is studied with the proportion test.

Two populations are defined by the two values of Y. Let us denote n1 and n2 their sizes. Variable X defines "success" or not. The proportions of successes are denoted x1/n1 and x2/n2 in the two populations, respectively.

The null hypothesis of the proportion test is “the proportion x1/n1 is equal to x2/n2”.

The command is:

prop.test(c(x1,x2),c(n1,n2),alternative)

As usual, the alternative is in:

Example In the titanic data set, we can test the link between survival and gender with a proportion test:

R> SF<-S[G=="F"]
R> SM<-S[G=="M"]
R> x1<-sum(SF=="yes")
R> n1<-length(SF)
R> x2<-sum(SM=="yes")
R> n2<-length(SM)
R> prop.test(c(x1,x2),c(n1,n2))

	2-sample test for equality of proportions with continuity correction

data:  c(x1, x2) out of c(n1, n2)
X-squared = 300.5, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.4924883 0.6023320
sample estimates:
   prop 1    prop 2 
0.7525773 0.2051672

Interpretation The p-value is very small, H0 is rejected: the proportion of survivals is significantly different for women and men.

We can continue analysing the link between the two variables testing a one-sided alternative: Did women survive more than men ? Women data (x1, n1) are given first in the instruction, the alternative we want to test is therefore greater: H1 : x1/n1 ≥ x2/n2:

R> prop.test(c(x1,x2),c(n1,n2), alternative="greater")

	2-sample test for equality of proportions with continuity correction

data:  c(x1, x2) out of c(n1, n2)
X-squared = 300.5, df = 1, p-value < 2.2e-16
alternative hypothesis: greater
95 percent confidence interval:
 0.5009889 1.0000000
sample estimates:
   prop 1    prop 2 
0.7525773 0.2051672

Interpretation The p-value is very small, H0 is rejected: women survived significantly more than men.

Another possibility exists for two binary variables, which is preferable: the Fisher’s exact test. Explaining it will be the occasion to present the odds-ratio, which is widely used in the applications to medicine.

In the standard presentation, the two binary variables are understood as “exposure” (e.g. smoking or not) and “disease” (e.g. lung cancer or not). The contingency table contains the frequencies of 4 categories of patients:

Diseased Healthy
exposed De He
unexposed Du Hu

The odds of disease among exposed patients are De/He. The odds among unexposed patients are Du/Hu.

The odds-ratio is:


OR = De/He.Du/Hu

If disease and exposure are independent, the odds-ratio is equal to one. If exposure favors the disease, the odds-ratio is greater than one. If exposure protects against the disease, the odds-ratio is less than one.

The Fisher's exact test compares the null hypothesis: “the odds-ratio is equal to 1” (exposure and disease are independent) with one of the alternative

  • two.sided: the odds-ratio is not equal to 1 (exposure and disease are not independent)

  • less: the odds-ratio is less than 1 (exposure protects against disease)

  • greater: the odds-ratio is greater than 1 (exposure favors disease).

The command is:

fisher.test(table(X,Y),alternative)

The odds-ratio, a confidence interval, and the p-value are returned.

Example We can apply the Fisher test to the titanic data set:

R> table(G,S)
   S
G    no yes
  F  96 292
  M 523 135
R> fisher.test(table(G,S))

	Fisher's Exact Test for Count Data

data:  table(G, S)
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.06225653 0.11556419
sample estimates:
odds ratio 
0.08513036

Interpretation The p-value is very small, H0 is rejected. Survival and Gender are dependent.

The odds-ratio (0.08) is less than 1. We can now test if being a women protected from death. The alternative is H1: the odds-ratio is less than 1:

R> fisher.test(table(S,G), alternative="less")

	Fisher's Exact Test for Count Data

data:  table(S, G)
p-value < 2.2e-16
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.0000000 0.1103079
sample estimates:
odds ratio 
0.08513036

Interpretation The p-value is very small, H0 is rejected. The odds-ratio (0.08) is less than 1: being a women protected significantly from death.

Linear regression

The scope of this part is slightly different from the previous part, in the sense that the notion of association is not understood as symmetric anymore. One variable, called response, is considered as dependent, explained, or predicted by the others, which are considered as explanatory variables or predictors.

A regression model is a function of predictors, as close as possible to the response.

There exist several types of regression models, depending on the type of the response variable. We consider here the case of two continuous variables X and Y, observed over the same population.
  • X is the explanatory variable, or predictor

  • Y is the response

Simple linear regression

We now give the definition of a simple linear model.

The simple linear regression of Y onto X consists in writing Y as a linear function of X:
Y = a + bX + e

  • a is the intercept: it is the value of the function at the origin x = 0, or the height of the intersection with the y-axis.

  • b is the slope: it measures by how much the straight line goes up or down when x increases by 1.

  • e is the residual: it is the difference between the response Y and the values of the model a + bX.

The graph of a linear function x ↦ a + bx is a straight line. If Y was exactly equal to a + bX, all points would be aligned on the scatterplot, and the correlation would be equal to +1 or −1. This never happens in practice. The distance between Y and the straight line is the value of the residual.

Estimation of the linear model

The linear regression model consists in finding the best straight line. The best model is the model such that the residuals are made as small as possible, by minimizing the Mean Squared Error MSE, which is the mean of squares of residuals.

The regression line of Y onto X has the optimal slope and intercept given by
$$\hat b = cov(X,Y)/var(X)$$

$$\hat a = \bar{Y} - \hat b *\bar{X}$$

This is calculated in R by:

slope <- cov(X,Y)/var(X)

intercept <- mean(Y)-slope*mean(X)

Example Data set mcas.csv is an excerpt from the Massaschusets Test Score Data Set, by the Massachusetts Comprehensive Assessment System (MCAS), Massachusets Department of Education, 1990 U.S. Census. Data from 195 schools are presented. The variables are:

Before computing the regression line of the score onto the income, the link between the two variables is represented with a scatter plot.

Code R :
R> mcas <- read.table("data/mcas.csv",header=TRUE,sep=";")
R> Y <- mcas[,"score4"]
R> X <- mcas[,"income"]
R> plot(X,Y)

Résultat : image

Interpretation The pattern of dots follows a trend. It indicates a positive correlation between the score and the income.

We now compute the regression line. Code R :
R> slope  <- cov(X,Y)/var(X)
R> intercept <- mean(Y)-slope*mean(X)
R> plot(X,Y)
R> abline(a=intercept,b=slope, col="red")

Résultat : image

Graphically, the regression line passes through the scatter plot, and contains the point with coordinates (mean(X),mean(Y)), which is the center of gravity of the scatter plot.

Interpretation The intercept is 677.102 and the slope is 1.734. It means that

  • without any income per capita, the baseline score is 677.102,

  • when the per capita income increases by 1, the score increases by 1.734.

Note that the intercept and the slope are estimates; as such, they have a certain precision, measured by standard errors. Confidence intervals can then be deduced.

Coefficient of determination

The quality of the regression is assessed by the coefficient of determination, named R-squared.

The coefficient of determination R-squared, defined as ratio of the variance of the model to the variance of the response, is an indicator of goodness of fit of the linear model.

The intuition of R − squared is as follows. Suppose you wan to predict Y. If you do not know the X's, then the best prediction is $\bar{Y}$, the empirical mean. The variability corresponding to this prediction is expressed by the total variation. However, if the model is utilized for the prediction, then the prediction error is reduced to the residual variation.

Interpretation It is a number between 0 and 1, understood as the proportion of the variance explained by the model: the closer to 1, the better the explanation.

Note that as more variables are added to the model, R − squared will increase. The adjusted coefficient of determination aims to correct this deficiency.

Test of the slope

The significance test of the regression is the two-sided test of slope=0.
H0 :  b  =  0  H1 :  b  ≠  0.

The statistic is based on the estimator of the slope $\hat b$ and its variance. The test returns the same p-value as the two-sided correlation test cor.test(X,Y).

Interpretation Accepting the null hypothesis slope=0, means declaring that there is no dependence of Y on X, or else, the linear regression is not meaningful. Rejecting slope=0, means declaring that the regression is meaningful.

Function lm in R

All calculations are done by the command lm for linear model. It requires a model formula, that describes the link betwwen the response variable Y and the explanatory variable X.

A linear model of Y onto X in R is fitted with

lm(Y~X)

The tilde symbol can be read as "is modeled as a function of".

After a model has been fitted using lm, a number of generic functions exists.

More details are given by:

summary(lm(Y~X))

Back to the example

R> fit<-lm(Y~X)
R> fit

Call:
lm(formula = Y ~ X)

Coefficients:
(Intercept)            X  
    677.102        1.734  

R> summary(fit)

Call:
lm(formula = Y ~ X)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.121  -6.656   0.959   7.476  30.793 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 677.1016     2.9348  230.71   <2e-16 ***
X             1.7341     0.1517   11.44   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12 on 193 degrees of freedom
Multiple R-squared:  0.4039,	Adjusted R-squared:  0.4008 
F-statistic: 130.8 on 1 and 193 DF,  p-value: < 2.2e-16

Interpretation The summary recalls

  • the model lm(formula = Y ~ X)

  • the main characteristics of the residuals: they should be centered, symmetric

  • the estimates of the coefficients Intercept and slope (on the row X) (677.1016 and 1.7341 in this example), their standard errors (measuring their precision) (2.9348 and 0.1517) and their associated tests (<2e-16: very small p-values in this example, the two hypotheses H0 are rejected);

  • the Residual standard error: the standard deviation of the residuals (12 in this example)

  • the Multiple and Adjusted R-squared: the highest their are, the best is the model (in this example, R-squared is 0.4039, only 40% of the variability of Y is explained by X).

  • the F-statistic and its p-value that tests the nullity of the slope (same test than the test of the slope for simple linear model)(p-value is < 2.2e-16).
  • plot(fit): produces diagnostic plots for model checking (see below)
  • coef(fit): extracts model coefficients from the estimated model
  • fitted(fit): extracts fitted values, predicted by the model for the values of the explanatory variables of the data set

Example

R> coef(fit)
(Intercept)           X 
 677.101607    1.734141

produces a vector with the two estimated values677.1016 and 1.7341 of slope and intercept.

Model diagnostics

The analysis of the residuals is a very important step, and a part of what is called model validation.

  • The residuals denote the difference between the observed value yi for subject i and the fitted value $X_i \times \hat b$:

residuals(fit)

  • The standardized residuals are the rescaled residuals: the residuals divided by their estimated standard deviation:

rstandard(fit)

For a good model, the residuals must be approximatively independent and identically distributed. If any systematic variation is detected, this indicates a lack-of-fit.

The typical plots are

Example Code R :
R> hist(rstandard(fit))

Résultat : image

Code R :
R> plot(X,residuals(fit))
R> abline(h=0)

Résultat : image

Code R :
R> plot(fitted(fit),residuals(fit))
R> abline(h=0)

Résultat : image

Code R :
R> plot(Y, fitted(fit), xlim=c(650, max(Y)), ylim=c(650,max(Y)))
R> abline(0,1)

Résultat : image

Interpretation The histogram is symmetric. The plots of the residuals onto X and onto the fitted values show no structure. But the plot of Y onto the fitted values is not aligned, the linear model is not very good.

Prediction

The main goal of a linear regression is prediction. The idea is to guess which value of Y would correspond to a certain unobserved value of X, denoted by x* (xstar).

The prediction of the model for a value x* is


$$Y^*=\hat a+\hat b\, x^*$$

Two types of precision intervals around the point prediction must be distinguished.

  • confidence interval on a + bx*: it answers the question ``where should the average response Y for X = x* be found?''.

  • prediction interval at x*: it answers the question ``where should one particular response Y for X = x* be found?''.

Intuitively, the average is more precise than the prediction of one particular value, because the average does not take the variability of residuals into account. The confidence interval on a + bx* is always narrower than the prediction interval at x*.

In R,

Example

R> new<-data.frame(X=40)
R> predict(fit,new, interval="prediction")
       fit      lwr      upr
1 746.4672 721.8821 771.0523
R> predict(fit,new, interval="confidence")
       fit      lwr      upr
1 746.4672 739.8178 753.1167

Interpretation The prediction score for an income of 40 is 746.

Note that the predictions are accurate for values close to the scatter plot. The length of the prediction and confidence intervals increases when x* is far from the scatter plot. Predictions should therefore be interpreted cautiously.

Multiple linear regression

Simple linear regression extends to several predictors, say p. The idea is to write the response Y as a linear combination of the predictors.


response = intercept + slope1 * predictor1 + slope2 * predictor2 + ...slopep * predictorp + e

Say there are two predictors X1 and X2 (p=2). All calculations are made by the function lm. The command is:

lyx <- lm(Y~X1+X2)

The object lyx returned by lm contains many informations about the regression. The most useful quantities are displayed by:

summary(lyx)

In particular:

Back to the example

Run a linear model of the score Y onto the regular spending per pupil RS and the total spending per pupil TS. We start by a simple linear model with only RS, then we the two predictors.

R> RS <- mcas[,"regspend"]
R> TS <- mcas[,"totspend"]
R> lm1 <- lm(Y~RS)
R> summary(lm1)

Call:
lm(formula = Y ~ RS)

Residuals:
    Min      1Q  Median      3Q     Max 
-55.672  -7.587   1.523   8.835  29.590 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.952e+02  6.062e+00 114.682   <2e-16 ***
RS          3.047e-03  1.302e-03   2.339   0.0204 *  
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.33 on 193 degrees of freedom
Multiple R-squared:  0.02757,	Adjusted R-squared:  0.02253 
F-statistic: 5.471 on 1 and 193 DF,  p-value: 0.02036

R> lm2 <- lm(Y~RS+TS)
R> summary(lm2)

Call:
lm(formula = Y ~ RS + TS)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.220  -7.786   0.254   8.665  34.007 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 702.178617   6.199061 113.272  < 2e-16 ***
RS            0.018747   0.004580   4.093 6.26e-05 ***
TS           -0.014762   0.004139  -3.567 0.000456 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.88 on 192 degrees of freedom
Multiple R-squared:  0.08799,	Adjusted R-squared:  0.07849 
F-statistic: 9.262 on 2 and 192 DF,  p-value: 0.0001445

Interpretation

The effect of RS alone (in the simple linear model lm1) is very small (3.047e-03) with a p-value of 0.0204. When adjusted to TS the total spending (in the multiple linear model lm2), the effect of RS is 0.0187 with a very small p-value (6e-05). This means that for two schools with the same value of TS, an increase of 1 in RS has an effect of 0.0187 on the score.

It is often of interest to compare two nested models. Suppose some initial model, say lm(Y~X1), is completed by one or more predictors into a new model, say lm(Y~X1+X2). The proportion of explained variance can only increase, when new predictors are added: is that increase significant?

The answer is given by an anova that compares the two models.

Comparison of two nested models

Run the smaller model:

lms <- lm(Y~X1)

Run the larger model:

lml <- lm(Y~X1+X2)

Comparison of the two models

anova(lms,lml)

Among other quantities, a p-value, labelled Pr(>F), is returned. If it is above the threshold, the larger model does not bring any new information compared to the smaller. If the p-value is small, the larger model is a better explanation of the response.

Back to the example

The objective is to select the best model. We start by comparing the two previous models

R> anova(lm(Y~RS),lm(Y~RS+TS))
Analysis of Variance Table

Model 1: Y ~ RS
Model 2: Y ~ RS + TS
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    193 45339                                  
2    192 42521  1    2817.4 12.721 0.0004564 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Interpretation The p-value of the anova is very small (0.0004564). We reject the smallest model. The model with both RS and TS explains better the score Y than the model with only RS.

Then we compare the full model with all available predictors with a model with only income, totspend.

R> SS <- mcas[,"specspend"]    # special need spending
R> SP <- mcas[,"special"]      # special education pupils
R> RA <- mcas[,"pupratio"]     # pupils per teacher ratio
R> IN <- mcas[,"income"]       # per capita income
R> AV <- mcas[,"avgsalary"]    # average teacher salary
R> lmfull <- lm(Y~RS+SS+TS+SP+RA+IN+AV)
R> lm3 <- lm(Y~TS+IN)
R> anova(lm3,lmfull)
Analysis of Variance Table

Model 1: Y ~ TS + IN
Model 2: Y ~ RS + SS + TS + SP + RA + IN + AV
  Res.Df   RSS Df Sum of Sq     F   Pr(>F)    
1    192 25947                                
2    187 21712  5    4235.1 7.295 2.88e-06 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Interpretation The p-value of the anova is very small (2.88e-06). We reject the smallest model. The full model explains better the score Y than the model with only TS and IN.

To select the best model, one would have to consider all the partial models and compare them with anova by hand. Another way is to use selection variable using function step that compares automatically a certain sequence of models.

Back to the example

R> lmfull <- lm(Y~RS+SS+TS+SP+RA+IN+AV)
R> step(lmfull, trace=0)

Call:
lm(formula = Y ~ TS + SP + RA + IN + AV)

Coefficients:
(Intercept)           TS           SP           RA           IN           AV  
 733.541925    -0.006207    -0.468174    -2.168482     1.770759     0.590422  

R> summary(step(lmfull, trace=0))

Call:
lm(formula = Y ~ TS + SP + RA + IN + AV)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.349  -6.255   0.217   6.286  36.364 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 733.541925  12.629220  58.083  < 2e-16 ***
TS           -0.006207   0.001098  -5.656 5.66e-08 ***
SP           -0.468174   0.222641  -2.103   0.0368 *  
RA           -2.168482   0.390027  -5.560 9.11e-08 ***
IN            1.770759   0.177932   9.952  < 2e-16 ***
AV            0.590422   0.337999   1.747   0.0823 .  
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.76 on 189 degrees of freedom
Multiple R-squared:  0.5311,	Adjusted R-squared:  0.5187 
F-statistic: 42.81 on 5 and 189 DF,  p-value: < 2.2e-16

Interpretation The best model includes TS, SP, RA, IN and AV. R-squared is 0.53: 53% of the variability of the score is explained by these predictors.

7/9/2020