Two sample tests
Association of two variables
We focus on testing the association or dependence of two variables observed on a given population. The question is: does the distribution of values of the first variable vary, given the values of the second? According to the nature (continuous, discrete, or binary) of the two variables, many tests can be used. We shall present only a few of them.
Continuous against binary
Consider a continuous variable, denoted by Cont
(e.g. age), and a binary variable, denoted by Bin
(e.g. gender). Descriptive statistics consists in computing the summaries of Cont
by values of Bin
, then display box plots of Cont
against Bin
:
by(Cont, Bin, summary)
boxplot(Cont~Bin)
If there is an association between Cont
and Bin
, the two summaries are different; of the two box plots, one is above the other, or one is larger than the other, etc.
Example In the titanic
data set, study the link between the age and the gender.
R> titanic <- read.table("data/titanic.csv",header=TRUE,sep=";") R> G <- titanic[,"gender"] R> A <- titanic[,"age"] R> by(A,G,summary) G: F Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1667 19.0000 27.0000 28.6871 38.0000 76.0000 ------------------------------------------------------------ G: M Min. 1st Qu. Median Mean 3rd Qu. Max. 0.3333 21.0000 28.0000 30.5852 39.0000 80.0000Code R :
R> boxplot(A~G)
Résultat :
The distributions can be represented through the plot of the empirical distribution function. Denote by X
and Y
the two subsamples of Cont
, according to the values of Bin
(e.g. X
: age of women, Y
: age of men). The ecdf are obtained with
plot(ecdf(X),col="red »)
lines(ecdf(Y),col="blue »)
Back to the example
Code R :R> Aw <- A[G=="F"] # ages of women R> Am <- A[G=="M"] # ages of men R> plot(ecdf(Aw),col="red") # ages of women R> lines(ecdf(Am),col="blue") # ages of men
Résultat :
The summaries and the plots are not sufficient to decide if the age means of men and women were different. To decide if the link between age and gender is significant, three tests can be applied.
In all three cases, the alternative is in two.sided
(default), less
, greater
. The interpretation of less
and greater
is relative to the first variable X
:
two.sided
: the mean, variance, or ecdf ofX
is not equal to that ofY
,less
: the mean, variance, or ecdf ofX
is less that ofY
,greater
: the mean, variance, or ecdf ofX
is greater than that ofY
.
Recall that if the ecdf of X
is less than that of Y
, the values of X
tend to be larger than those of `Y.
Example In the titanic
data set, study the link between the age and the gender.
R> t.test(Aw,Am) Welch Two Sample t-test data: Aw and Am t = -2.0497, df = 798.36, p-value = 0.04072 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.71595773 -0.08036699 sample estimates: mean of x mean of y 28.68707 30.58523 R> t.test(A~G) Welch Two Sample t-test data: A by G t = -2.0497, df = 798.36, p-value = 0.04072 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.71595773 -0.08036699 sample estimates: mean in group F mean in group M 28.68707 30.58523The two instructions are equivalent.
The mean of women is less than that of men. To test if it is significant, we apply a one-sided test:
R> t.test(A~G, alternative="less") Welch Two Sample t-test data: A by G t = -2.0497, df = 798.36, p-value = 0.02036 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -0.3731636 sample estimates: mean in group F mean in group M 28.68707 30.58523
We can also compare the two variances and the two ecdf through the Fisher's (variance) test and Kolmogorov-Smirnov test.
R> var.test(Aw,Am) F test to compare two variances data: Aw and Am F = 1.0419, num df = 387, denom df = 657, p-value = 0.6442 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.8740042 1.2473141 sample estimates: ratio of variances 1.041945 R> ks.test(Aw,Am,alternative="greater") Two-sample Kolmogorov-Smirnov test data: Aw and Am D^+ = 0.084636, p-value = 0.03029 alternative hypothesis: the CDF of x lies above that of y
Continuous against discrete
This section considers the case of tha continuous variable and a discrete variable (not necessarily binary).
Consider a continuous variable, denoted by Cont
(e.g. heart rate), and a discrete variable, denoted by Disc
(e.g. type of disease). Here again, the first thing to do is to display summaries of Cont
by values of Disc
, and box plots of Cont
against `Disc.
by(Cont,Disc,summary)
boxplot(Cont~Disc)
Example In the titanic
data set, study the link between the age and the class.
R> titanic <- read.table("data/titanic.csv",header=TRUE,sep=";") R> P <- titanic[, "pclass"] R> A <- titanic[,"age"] R> by(A,P,summary) R> boxplot(A~P)
Résultat :
The discrete variable Disc
divides the population in as many groups, as different values: if there are 3 types of disease, then there are 3 groups of patients, one per type of disease. If there is no dependence between Cont
and Disc
, the means of Cont
within the different groups should be equal.
If there are only two groups, the variable is binary, and the tests of the previous section apply. In that case, the two-sided T-test will return exactly the same p-value as the one-way anova.
If the p-value is small, the conclusion is that there are significant differences between the means of the different groups. Those differences should be visible on the box plot. Differences between the means of two particular groups, or between the mean of one group and the mean in the rest of the population, can be tested using a one-sided T-test.
Without giving mathematical details, we shall now explain the expression analysis of variance. The global variance of Cont
can be written as the sum of two variances: global variance = variance between groups + variance within groups
1. The variance between groups is the variance of the means of groups: the smaller the differences between means, the smaller the variance between groups.
2. The variance within groups is the mean of variances of groups. 2
If all group means are equal (null hypothesis of the test), then the variance between groups is null. The higher the ratio of the first variance to the second, the more significant the differences between the means. That ratio is returned by the function oneway.test, together with the corresponding p-value.
Back to the example In the titanic
data set, study the link between the age and the class.
R> oneway.test(A~P) One-way analysis of means (not assuming equal variances) data: A and P F = 99.967, num df = 2.00, denom df = 546.72, p-value < 2.2e-16
The next step is to test the difference between two specific classes. This can be done with a T-test.
R> A1 <- A[P==1] # ages of first class R> A2 <- A[P==2] # ages of second class R> A3 <- A[P==3] # ages of third class passengers R> t.test(A1,c(A2,A3),alternative="greater") # first against others Welch Two Sample t-test data: A1 and c(A2, A3) t = 13.01, df = 454.27, p-value < 2.2e-16 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 11.12341 Inf sample estimates: mean of x mean of y 39.15992 26.42290 R> t.test(A1,A2,alternative="greater") # first against second Welch Two Sample t-test data: A1 and A2 t = 7.9947, df = 542.78, p-value = 3.932e-15 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 7.663748 Inf sample estimates: mean of x mean of y 39.15992 29.50670 R> t.test(A1,A3,alternative="greater") # first against third Welch Two Sample t-test data: A1 and A3 t = 14.129, df = 499.8, p-value < 2.2e-16 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 12.67057 Inf sample estimates: mean of x mean of y 39.15992 24.81637 R> t.test(A2,c(A1,A3)) # second against others Welch Two Sample t-test data: A2 and c(A1, A3) t = -0.50225, df = 475.13, p-value = 0.6157 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.450859 1.453016 sample estimates: mean of x mean of y 29.50670 30.00563 R> t.test(A2,A3,alternative="greater") # first against others Welch Two Sample t-test data: A2 and A3 t = 4.6948, df = 470.7, p-value = 1.753e-06 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 3.043793 Inf sample estimates: mean of x mean of y 29.50670 24.81637
Continuous against continuous
The dependence between two continuous variables X
and Y
, observed over the same population, will be tested through their correlation. The first thing to do is a scatter plot of the two variables.
plot(X,Y)
If the correlation is positive, the two variables tend to vary in the same direction: if one increases, the other one does too. If the correlation is negative, the variables vary in opposite directions. The closer the correlation is to plus or minus 1, the closer the scatter plot to a straight line, and the stronger the association of the two variables.
The alternative is in:
two.sided
: the correlation is not equal to 0 (X
andY
are correlated),less
: the correlation is less than 0 (X
andY
are negatively correlated),greater
: the correlation is greater than 0 (X
andY
are positively correlated).
Example 2 In the tauber
data set, study the link between age
, height
and weight
.
R> TA <- read.table("data/tauber.csv", header=TRUE,sep=";") R> head(TA) gender age height weight 1 F 68 113 20.0 2 M 74 116 17.5 3 M 69 120 23.0 4 M 72 121 25.0 5 M 73 114 17.0 6 M 65 115 20.0The correlations between each pair of variables are plotted with Code R :
R> A <- TA[,"age"] R> H <- TA[,"height"] R> W <- TA[,"weight"] R> AHW <- TA[,2:4] R> pairs(AHW)
Résultat :
The plot is equivalent too
R> plot(AHW)
or
R> plot(A,H) R> plot(A,W) R> plot(H,W)
Their correlation is obtained with
R> cor(AHW) age height weight age 1.0000000 0.3671279 0.2214119 height 0.3671279 1.0000000 0.7186317 weight 0.2214119 0.7186317 1.0000000
The correlation between age
and height is smaller than that between
heightand
weight` but we can not directly conclude if they are significantly different from zero. There is not a unique threshold on the scale of the correlation to conclude if a correlation is different from zero or not. We need to apply a correlation test to decide.
R> cor.test(A,H,alternative="greater") Pearson's product-moment correlation data: A and H t = 21.214, df = 2889, p-value < 2.2e-16 alternative hypothesis: true correlation is greater than 0 95 percent confidence interval: 0.3403532 1.0000000 sample estimates: cor 0.3671279 R> cor.test(A,W,alternative="greater") Pearson's product-moment correlation data: A and W t = 12.204, df = 2889, p-value < 2.2e-16 alternative hypothesis: true correlation is greater than 0 95 percent confidence interval: 0.1921155 1.0000000 sample estimates: cor 0.2214119 R> cor.test(H,W,alternative="greater") Pearson's product-moment correlation data: H and W t = 55.546, df = 2889, p-value < 2.2e-16 alternative hypothesis: true correlation is greater than 0 95 percent confidence interval: 0.7035029 1.0000000 sample estimates: cor 0.7186317
Discrete against discrete
The conditional distributions of two discrete variables X and Y, can be visualized using contingency tables and barplots.If the two variables are independent (null hypothesis), then the conditional frequencies of one variable, should not depend on the values of the other. Visually, the bar heights should be equal on the bar plot.
Example In the titanic
data set, let us compute the conditional frequencies of the survival given the passengers class:
R> S<-titanic[ ,"survived"] R> prop.table(table(S,P),1) P S 1 2 3 no 0.1663974 0.2358643 0.5977383 yes 0.4238876 0.2693208 0.3067916Code R :
R> barplot(prop.table(table(S,P),1),beside=TRUE)
Résultat :
The parameter of the chi- squared distribution is the product of the number of values in X minus one, times the number of values in Y minus one. For instance, if there are 3 values in X and 4 in Y, the parameter will be df = (3-1)*(4-1) = 6
. The function chisq.test
automatically calculates df
.
Example Let us test the link between class and survival in titanic
dataset.
R> chisq.test(table(S,P)) Pearson's Chi-squared test data: table(S, P) X-squared = 107.5, df = 2, p-value < 2.2e-16
Binary against binary
In the specific case of two binary variables, one can also use a proportion test to decide whether the proportion of the population with a given value for X is the same, over the two values of Y.
As usual, the alternative is in:
two.sided
: the proportion x1/n1 is not equal to the proportion x2/n2less
: the proportion x1/n1 is less than the proportion x2/n2greater
: the proportion x1/n1 is greater than the proportion x2/n2.
Example In the titanic
data set, we can test the link between survival and gender with a proportion test:
R> SF<-S[G=="F"] R> SM<-S[G=="M"] R> x1<-sum(SF=="yes") R> n1<-length(SF) R> x2<-sum(SM=="yes") R> n2<-length(SM) R> prop.test(c(x1,x2),c(n1,n2)) 2-sample test for equality of proportions with continuity correction data: c(x1, x2) out of c(n1, n2) X-squared = 300.5, df = 1, p-value < 2.2e-16 alternative hypothesis: two.sided 95 percent confidence interval: 0.4924883 0.6023320 sample estimates: prop 1 prop 2 0.7525773 0.2051672
We can continue analysing the link between the two variables testing a one-sided alternative: Did women survive more than men ? Women data (x1
, n1
) are given first in the instruction, the alternative we want to test is therefore greater
: H1 : x1/n1 ≥ x2/n2:
R> prop.test(c(x1,x2),c(n1,n2), alternative="greater") 2-sample test for equality of proportions with continuity correction data: c(x1, x2) out of c(n1, n2) X-squared = 300.5, df = 1, p-value < 2.2e-16 alternative hypothesis: greater 95 percent confidence interval: 0.5009889 1.0000000 sample estimates: prop 1 prop 2 0.7525773 0.2051672
Another possibility exists for two binary variables, which is preferable: the Fisher’s exact test. Explaining it will be the occasion to present the odds-ratio, which is widely used in the applications to medicine.
In the standard presentation, the two binary variables are understood as “exposure” (e.g. smoking or not) and “disease” (e.g. lung cancer or not). The contingency table contains the frequencies of 4 categories of patients:
Diseased | Healthy | |
exposed | ||
unexposed |
The odds-ratio, a confidence interval, and the p-value are returned.
Example We can apply the Fisher test to the titanic
data set:
R> table(G,S) S G no yes F 96 292 M 523 135 R> fisher.test(table(G,S)) Fisher's Exact Test for Count Data data: table(G, S) p-value < 2.2e-16 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.06225653 0.11556419 sample estimates: odds ratio 0.08513036
The odds-ratio (0.08
) is less than 1. We can now test if being a women protected from death. The alternative is H1: the odds-ratio is less than 1:
R> fisher.test(table(S,G), alternative="less") Fisher's Exact Test for Count Data data: table(S, G) p-value < 2.2e-16 alternative hypothesis: true odds ratio is less than 1 95 percent confidence interval: 0.0000000 0.1103079 sample estimates: odds ratio 0.08513036
Linear regression
The scope of this part is slightly different from the previous part, in the sense that the notion of association is not understood as symmetric anymore. One variable, called response, is considered as dependent, explained, or predicted by the others, which are considered as explanatory variables or predictors.
There exist several types of regression models, depending on the type of the response variable. We consider here the case of two continuous variables X and Y, observed over the same population.Simple linear regression
We now give the definition of a simple linear model.The graph of a linear function x ↦ a + bx is a straight line. If Y was exactly equal to a + bX, all points would be aligned on the scatterplot, and the correlation would be equal to +1 or −1. This never happens in practice. The distance between Y and the straight line is the value of the residual.
Estimation of the linear model
The linear regression model consists in finding the best straight line. The best model is the model such that the residuals are made as small as possible, by minimizing the Mean Squared Error MSE, which is the mean of squares of residuals.
Example Data set mcas.csv
is an excerpt from the Massaschusets Test Score Data Set, by the Massachusetts Comprehensive Assessment System (MCAS), Massachusets Department of Education, 1990 U.S. Census. Data from 195 schools are presented. The variables are:
regspend
: spending per pupil, regularspecspend
: spending per pupil, special needs totspend: spending per pupil, total special: special education pupilsincome
: per capita incomestudratio
: pupils per teacheravgsalary
: average teacher salaryscore4
: 4th grade score (math+english+science)
Before computing the regression line of the score onto the income, the link between the two variables is represented with a scatter plot.
Code R :R> mcas <- read.table("data/mcas.csv",header=TRUE,sep=";") R> Y <- mcas[,"score4"] R> X <- mcas[,"income"] R> plot(X,Y)
Résultat :
We now compute the regression line. Code R :R> slope <- cov(X,Y)/var(X) R> intercept <- mean(Y)-slope*mean(X) R> plot(X,Y) R> abline(a=intercept,b=slope, col="red")
Résultat :
Note that the intercept and the slope are estimates; as such, they have a certain precision, measured by standard errors. Confidence intervals can then be deduced.
Coefficient of determination
The quality of the regression is assessed by the coefficient of determination, named R-squared.
The intuition of R − squared is as follows. Suppose you wan to predict Y. If you do not know the X's, then the best prediction is $\bar{Y}$, the empirical mean. The variability corresponding to this prediction is expressed by the total variation. However, if the model is utilized for the prediction, then the prediction error is reduced to the residual variation.Note that as more variables are added to the model, R − squared will increase. The adjusted coefficient of determination aims to correct this deficiency.
Test of the slope
The significance test of the regression is the two-sided test of slope=0
.
H0 : b = 0 H1 : b ≠ 0.
The statistic is based on the estimator of the slope $\hat b$ and its variance. The test returns the same p-value as the two-sided correlation test cor.test(X,Y)
.
Function lm
in R
All calculations are done by the command lm
for linear model. It requires a model formula, that describes the link betwwen the response variable Y and the explanatory variable X.
The tilde symbol can be read as "is modeled as a function of".
After a model has been fitted using lm
, a number of generic functions exists.
More details are given by:
summary(lm(Y~X))
Back to the example
R> fit<-lm(Y~X) R> fit Call: lm(formula = Y ~ X) Coefficients: (Intercept) X 677.102 1.734 R> summary(fit) Call: lm(formula = Y ~ X) Residuals: Min 1Q Median 3Q Max -40.121 -6.656 0.959 7.476 30.793 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 677.1016 2.9348 230.71 <2e-16 *** X 1.7341 0.1517 11.44 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12 on 193 degrees of freedom Multiple R-squared: 0.4039, Adjusted R-squared: 0.4008 F-statistic: 130.8 on 1 and 193 DF, p-value: < 2.2e-16
Example
R> coef(fit) (Intercept) X 677.101607 1.734141
produces a vector with the two estimated values677.1016
and 1.7341
of slope and intercept.
Model diagnostics
The analysis of the residuals is a very important step, and a part of what is called model validation.
For a good model, the residuals must be approximatively independent and identically distributed. If any systematic variation is detected, this indicates a lack-of-fit.
The typical plots are
- histogram of the residuals: should be symmetric around 0.
- scatter plot of residuals against variables X: should not have any structure, symmetric around the x-axis.
- scatter plot of residuals against fitted values: should not have any structure, symmetric around the x-axis.
- scatter plot of fitted values against response variable Y: should follow the straight line y = x
R> hist(rstandard(fit))
Résultat :
Code R :R> plot(X,residuals(fit)) R> abline(h=0)
Résultat :
Code R :R> plot(fitted(fit),residuals(fit)) R> abline(h=0)
Résultat :
Code R :R> plot(Y, fitted(fit), xlim=c(650, max(Y)), ylim=c(650,max(Y))) R> abline(0,1)
Résultat :
Prediction
The main goal of a linear regression is prediction. The idea is to guess which value of Y would correspond to a certain unobserved value of X, denoted by x* (xstar
).
Two types of precision intervals around the point prediction must be distinguished.
Intuitively, the average is more precise than the prediction of one particular value, because the average does not take the variability of residuals into account. The confidence interval on a + bx* is always narrower than the prediction interval at x*.
In R
,
predict(fit)
produces predicted values, standard errors of the predictions and confidence and prediction intervalspredict(fit,new)
produces predicted values for X = xstar (new=data.frame(X=xstar)
).
Example
R> new<-data.frame(X=40) R> predict(fit,new, interval="prediction") fit lwr upr 1 746.4672 721.8821 771.0523 R> predict(fit,new, interval="confidence") fit lwr upr 1 746.4672 739.8178 753.1167
Note that the predictions are accurate for values close to the scatter plot. The length of the prediction and confidence intervals increases when x* is far from the scatter plot. Predictions should therefore be interpreted cautiously.
Multiple linear regression
Simple linear regression extends to several predictors, say p
. The idea is to write the response Y as a linear combination of the predictors.
response = intercept + slope1 * predictor1 + slope2 * predictor2 + ...slopep * predictorp + e
Say there are two predictors X1
and X2
(p=2
). All calculations are made by the function lm
. The command is:
lyx <- lm(Y~X1+X2)
The object lyx
returned by lm
contains many informations about the regression. The most useful quantities are displayed by:
summary(lyx)
In particular:
Residuals
: a summary of the residuals (the residuals themselves are inlyx$residuals
).Coefficients
: in that matrix, the first column namedEstimates
contains the estimates of intercept and slopes;Pr(>|t|)
: gives the p-values of a two-sided significance test for each coefficient. The smaller the p-value, the more the corresponding predictor contributes to the response Y.Multiple R-squared
: the proportion of the response variance which is explained by the model.Adjusted R-squared
: the same, ajusted for the number of predictors (the proportion of explained variance must increase if new predictors are added, even if they do not bring any information on the response).p-value
: measures the overall significance of the model.
Back to the example
Run a linear model of the score Y
onto the regular spending per pupil RS
and the total spending per pupil TS
. We start by a simple linear model with only RS
, then we the two predictors.
R> RS <- mcas[,"regspend"] R> TS <- mcas[,"totspend"] R> lm1 <- lm(Y~RS) R> summary(lm1) Call: lm(formula = Y ~ RS) Residuals: Min 1Q Median 3Q Max -55.672 -7.587 1.523 8.835 29.590 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.952e+02 6.062e+00 114.682 <2e-16 *** RS 3.047e-03 1.302e-03 2.339 0.0204 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 15.33 on 193 degrees of freedom Multiple R-squared: 0.02757, Adjusted R-squared: 0.02253 F-statistic: 5.471 on 1 and 193 DF, p-value: 0.02036 R> lm2 <- lm(Y~RS+TS) R> summary(lm2) Call: lm(formula = Y ~ RS + TS) Residuals: Min 1Q Median 3Q Max -60.220 -7.786 0.254 8.665 34.007 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 702.178617 6.199061 113.272 < 2e-16 *** RS 0.018747 0.004580 4.093 6.26e-05 *** TS -0.014762 0.004139 -3.567 0.000456 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.88 on 192 degrees of freedom Multiple R-squared: 0.08799, Adjusted R-squared: 0.07849 F-statistic: 9.262 on 2 and 192 DF, p-value: 0.0001445
It is often of interest to compare two nested models. Suppose some initial model, say lm(Y~X1)
, is completed by one or more predictors into a new model, say lm(Y~X1+X2)
. The proportion of explained variance can only increase, when new predictors are added: is that increase significant?
The answer is given by an anova that compares the two models.
Among other quantities, a p-value, labelled Pr(>F)
, is returned. If it is above the threshold, the larger model does not bring any new information compared to the smaller. If the p-value is small, the larger model is a better explanation of the response.
Back to the example
The objective is to select the best model. We start by comparing the two previous models
R> anova(lm(Y~RS),lm(Y~RS+TS)) Analysis of Variance Table Model 1: Y ~ RS Model 2: Y ~ RS + TS Res.Df RSS Df Sum of Sq F Pr(>F) 1 193 45339 2 192 42521 1 2817.4 12.721 0.0004564 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Then we compare the full model with all available predictors with a model with only income
, totspend
.
R> SS <- mcas[,"specspend"] # special need spending R> SP <- mcas[,"special"] # special education pupils R> RA <- mcas[,"pupratio"] # pupils per teacher ratio R> IN <- mcas[,"income"] # per capita income R> AV <- mcas[,"avgsalary"] # average teacher salary R> lmfull <- lm(Y~RS+SS+TS+SP+RA+IN+AV) R> lm3 <- lm(Y~TS+IN) R> anova(lm3,lmfull) Analysis of Variance Table Model 1: Y ~ TS + IN Model 2: Y ~ RS + SS + TS + SP + RA + IN + AV Res.Df RSS Df Sum of Sq F Pr(>F) 1 192 25947 2 187 21712 5 4235.1 7.295 2.88e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To select the best model, one would have to consider all the partial models and compare them with anova
by hand. Another way is to use selection variable using function step
that compares automatically a certain sequence of models.
Back to the example
R> lmfull <- lm(Y~RS+SS+TS+SP+RA+IN+AV) R> step(lmfull, trace=0) Call: lm(formula = Y ~ TS + SP + RA + IN + AV) Coefficients: (Intercept) TS SP RA IN AV 733.541925 -0.006207 -0.468174 -2.168482 1.770759 0.590422 R> summary(step(lmfull, trace=0)) Call: lm(formula = Y ~ TS + SP + RA + IN + AV) Residuals: Min 1Q Median 3Q Max -30.349 -6.255 0.217 6.286 36.364 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 733.541925 12.629220 58.083 < 2e-16 *** TS -0.006207 0.001098 -5.656 5.66e-08 *** SP -0.468174 0.222641 -2.103 0.0368 * RA -2.168482 0.390027 -5.560 9.11e-08 *** IN 1.770759 0.177932 9.952 < 2e-16 *** AV 0.590422 0.337999 1.747 0.0823 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 10.76 on 189 degrees of freedom Multiple R-squared: 0.5311, Adjusted R-squared: 0.5187 F-statistic: 42.81 on 5 and 189 DF, p-value: < 2.2e-16