Regression models

Linear regression
Logistic regression
Survival analysis

The scope of this chapter is slightly different from the chapter Two sample tests, in the sense that the notion of association is not understood as symmetric anymore. One variable, called response, is considered as dependent, explained, or predicted by the others, which are considered as explanatory variables or predictors.

A regression model is a function of predictors, as close as possible to the response.

There exist several types of regression models, depending on the type of the response variable:

Continuous response: linear model, that can be split into two sub-types: linear regression model (continuous predictors) and ANOVA (discrete predictors)
Binary response: logistic regression model
Ordinal discrete response: polytomic regression model
Continous but censored response: survival model

We begin with linear regression.

Linear regression

We consider here the case of two continuous variables X and Y, observed over the same population.

X is the explanatory variable, or predictor
Y is the response

Definition of the simple linear model

We now give the definition of a simple linear model.

The simple linear regression of Y onto X consists in writing Y as a linear function of X:
Y = a + bX + e

a is the intercept: it is the value of the function at the origin x = 0, or the height of the intersection with the y-axis.
b is the slope: it measures by how much the straight line goes up or down when x increases by 1.
e is the residual: it is the difference between the response Y and the values of the model a + bX.

The graph of a linear function x ↦ a + bx is a straight line. If Y was exactly equal to a + bX, all points would be aligned on the scatterplot, and the correlation would be equal to +1 or −1. This never happens in practice. The distance between Y and the straight line is the value of the residual.

Estimation of the linear model

The linear regression model consists in finding the best straight line. The best model is the model such that the residuals are made as small as possible, by minimizing the Mean Squared Error MSE, which is the mean of squares of residuals.

The line regression line of Y onto X has the optimal slope and intercept given by
$$\hat b = cov(X,Y)/var(X)$$

$$\hat a = \bar{Y} - \hat b * \bar{X}$$

This is calculated in R by:

slope <- cov(X,Y)/var(X)

intercept <- mean(Y)-slope*mean(X)

Example Data set mcas.csv is an excerpt from the Massaschusets Test Score Data Set, by the Massachusetts Comprehensive Assessment System (MCAS), Massachusets Department of Education, 1990 U.S. Census. Data from 195 schools are presented. The variables are:

regspend: spending per pupil, regular
specspend: spending per pupil, special needs totspend: spending per pupil, total special: special education pupils
income: per capita income
studratio: pupils per teacher
avgsalary: average teacher salary
score4: 4th grade score (math+english+science)

Before computing the regression line of the score onto the income, the link between the two variables is represented with a scatter plot.

Code R :

R> mcas <- read.table("data/mcas.csv",header=TRUE,sep=";")
R> Y <- mcas[,"score4"]
R> X <- mcas[,"income"]
R> plot(X,Y)

Résultat :

Interpretation The pattern of dots follows a trend. It indicates a positive correlation between the score and the income.

We now compute the regression line. Code R :

R> slope  <- cov(X,Y)/var(X)
R> intercept <- mean(Y)-slope*mean(X)
R> plot(X,Y)
R> abline(a=intercept,b=slope, col="red")

Résultat :

Graphically, the regression line passes through the scatter plot, and contains the point with coordinates (mean(X),mean(Y)), which is the center of gravity of the scatter plot.

Interpretation The intercept is 677.102 and the slope is 1.734. It means that

without any income per capita, the baseline score is 677.102,
when the per capita income increases by 1, the score increases by 1.734.

Note that the intercept and the slope are estimates; as such, they have a certain precision, measured by standard errors. Confidence intervals can then be deduced.

Coefficient of determination

The quality of the regression is assessed by the coefficient of determination, named R-squared.

The coefficient of determination R-squared, defined as ratio of the variance of the model to the variance of the response, is an indicator of goodness of fit of the linear model.

The intuition of R − squared is as follows. Suppose you wan to predict Y. If you do not know the X's, then the best prediction is $\bar{Y}$, the empirical mean. The variability corresponding to this prediction is expressed by the total variation. However, if the model is utilized for the prediction, then the prediction error is reduced to the residual variation.

Interpretation It is a number between 0 and 1, understood as the proportion of the variance explained by the model: the closer to 1, the better the explanation.

Note that as more variables are added to the model, R − squared will increase. The adjusted coefficient of determination aims to correct this deficiency.

Test of the slope

The significance test of the regression is the two-sided test of slope=0.
H₀ : b = 0 H₁ : b ≠ 0.

The statistic is based on the estimator of the slope $\hat b$ and its variance. The test returns the same p-value as the two-sided correlation test cor.test(X,Y).

Interpretation Accepting the null hypothesis slope=0, means declaring that there is no dependence of Y on X, or else, the linear regression is not meaningful. Rejecting slope=0, means declaring that the regression is meaningful.

Function `lm` in `R`

All calculations are done by the command lm for linear model. It requires a model formula, that describes the link betwwen the response variable Y and the explanatory variable X.

A linear model of Y onto X in R is fitted with

lm(Y~X)

The tilde symbol can be read as "is modeled as a function of".

After a model has been fitted using lm, a number of generic functions exists.

More details are given by:

summary(lm(Y~X))

Back to the example

R> fit<-lm(Y~X)
R> fit

Call:
lm(formula = Y ~ X)

Coefficients:
(Intercept)            X  
    677.102        1.734  

R> summary(fit)

Call:
lm(formula = Y ~ X)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.121  -6.656   0.959   7.476  30.793 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 677.1016     2.9348  230.71   <2e-16 ***
X             1.7341     0.1517   11.44   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12 on 193 degrees of freedom
Multiple R-squared:  0.4039,    Adjusted R-squared:  0.4008 
F-statistic: 130.8 on 1 and 193 DF,  p-value: < 2.2e-16

Interpretation The summary recalls

the model lm(formula = Y ~ X)
the main characteristics of the residuals: they should be centered, symmetric
the estimates of the coefficients Intercept and slope (on the row X) (677.1016 and 1.7341 in this example), their standard errors (measuring their precision) (2.9348 and 0.1517) and their associated tests (<2e-16: very small p-values in this example, the two hypotheses H₀ are rejected);
the Residual standard error: the standard deviation of the residuals (12 in this example)
the Multiple and Adjusted R-squared: the highest their are, the best is the model (in this example, R-squared is 0.4039, only 40% of the variability of Y is explained by X).
the F-statistic and its p-value that tests the nullity of the slope (same test than the test of the slope for simple linear model)(p-value is < 2.2e-16).

plot(fit): produces diagnostic plots for model checking (see below)
coef(fit): extracts model coefficients from the estimated model
fitted(fit): extracts fitted values, predicted by the model for the values of the explanatory variables of the data set

Example

R> coef(fit)
(Intercept)           X 
 677.101607    1.734141

produces a vector with the two estimated values677.1016 and 1.7341 of slope and intercept.

Model diagnostics

The analysis of the residuals is a very important step, and a part of what is called model validation.

The residuals denote the difference between the observed value y_i for subject i and the fitted value $X_i \times \hat b$:

residuals(fit)

The standardized residuals are the rescaled residuals: the residuals divided by their estimated standard deviation:

rstandard(fit)

For a good model, the residuals must be approximatively independent and identically distributed. If any systematic variation is detected, this indicates a lack-of-fit.

The typical plots are

histogram of the residuals: should be symmetric around 0.
scatter plot of residuals against variables X: should not have any structure, symmetric around the x-axis.
scatter plot of residuals against fitted values: should not have any structure, symmetric around the x-axis.
scatter plot of fitted values against response variable Y: should follow the straight line y = x

Example Code R :

R> hist(rstandard(fit))

Résultat :

Code R :

R> plot(X,residuals(fit))
R> abline(h=0)

Résultat :

Code R :

R> plot(fitted(fit),residuals(fit))
R> abline(h=0)

Résultat :

Code R :

R> plot(Y, fitted(fit), xlim=c(650, max(Y)), ylim=c(650,max(Y)))
R> abline(0,1)

Résultat :

Interpretation The histogram is symmetric. The plots of the residuals onto X and onto the fitted values show no structure. But the plot of Y onto the fitted values is not aligned, the linear model is not very good.

Prediction

The main goal of a linear regression is prediction. The idea is to guess which value of Y would correspond to a certain unobserved value of X, denoted by x^* (xstar).

The prediction of the model for a value x^* is

$$Y^*=\hat a+\hat b\, x^*$$

Two types of precision intervals around the point prediction must be distinguished.

confidence interval on a + bx^*: it answers the question ``where should the average response Y for X = x^* be found?''.
prediction interval at x^*: it answers the question ``where should one particular response Y for X = x^* be found?''.

Intuitively, the average is more precise than the prediction of one particular value, because the average does not take the variability of residuals into account. The confidence interval on a + bx^* is always narrower than the prediction interval at x^*.

In R,

predict(fit) produces predicted values, standard errors of the predictions and confidence and prediction intervals
predict(fit,new) produces predicted values for X = xstar (new=data.frame(X=xstar)).

Example

R> new<-data.frame(X=40)
R> predict(fit,new, interval="prediction")
       fit      lwr      upr
1 746.4672 721.8821 771.0523
R> predict(fit,new, interval="confidence")
       fit      lwr      upr
1 746.4672 739.8178 753.1167

Interpretation The prediction score for an income of 40 is 746.

Note that the predictions are accurate for values close to the scatter plot. The length of the prediction and confidence intervals increases when x^* is far from the scatter plot. Predictions should therefore be interpreted cautiously.

Multiple linear regression

Simple linear regression extends to several predictors, say p. The idea is to write the response Y as a linear combination of the predictors.

response = intercept + slope1 * predictor1 + slope2 * predictor2 + ...slopep * predictorp + e

Say there are two predictors X1 and X2 (p=2). All calculations are made by the function lm. The command is:

lyx <- lm(Y~X1+X2)

The object lyx returned by lm contains many informations about the regression. The most useful quantities are displayed by:

summary(lyx)

In particular:

Residuals: a summary of the residuals (the residuals themselves are in lyx$residuals).
Coefficients: in that matrix, the first column named Estimates contains the estimates of intercept and slopes;
Pr(>|t|): gives the p-values of a two-sided significance test for each coefficient. The smaller the p-value, the more the corresponding predictor contributes to the response Y.
Multiple R-squared: the proportion of the response variance which is explained by the model.
Adjusted R-squared: the same, ajusted for the number of predictors (the proportion of explained variance must increase if new predictors are added, even if they do not bring any information on the response).
p-value: measures the overall significance of the model.

Back to the example

Run a linear model of the score Y onto the regular spending per pupil RS and the total spending per pupil TS. We start by a simple linear model with only RS, then we the two predictors.

R> RS <- mcas[,"regspend"]
R> TS <- mcas[,"totspend"]
R> lm1 <- lm(Y~RS)
R> summary(lm1)

Call:
lm(formula = Y ~ RS)

Residuals:
    Min      1Q  Median      3Q     Max 
-55.672  -7.587   1.523   8.835  29.590 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.952e+02  6.062e+00 114.682   <2e-16 ***
RS          3.047e-03  1.302e-03   2.339   0.0204 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.33 on 193 degrees of freedom
Multiple R-squared:  0.02757,   Adjusted R-squared:  0.02253 
F-statistic: 5.471 on 1 and 193 DF,  p-value: 0.02036

R> lm2 <- lm(Y~RS+TS)
R> summary(lm2)

Call:
lm(formula = Y ~ RS + TS)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.220  -7.786   0.254   8.665  34.007 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 702.178617   6.199061 113.272  < 2e-16 ***
RS            0.018747   0.004580   4.093 6.26e-05 ***
TS           -0.014762   0.004139  -3.567 0.000456 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.88 on 192 degrees of freedom
Multiple R-squared:  0.08799,   Adjusted R-squared:  0.07849 
F-statistic: 9.262 on 2 and 192 DF,  p-value: 0.0001445

Interpretation

The effect of RS alone (in the simple linear model lm1) is very small (3.047e-03) with a p-value of 0.0204. When adjusted to TS the total spending (in the multiple linear model lm2), the effect of RS is 0.0187 with a very small p-value (6e-05). This means that for two schools with the same value of TS, an increase of 1 in RS has an effect of 0.0187 on the score.

It is often of interest to compare two nested models. Suppose some initial model, say lm(Y~X1), is completed by one or more predictors into a new model, say lm(Y~X1+X2). The proportion of explained variance can only increase, when new predictors are added: is that increase significant?

The answer is given by an anova that compares the two models.

Comparison of two nested models

Run the smaller model:

lms <- lm(Y~X1)

Run the larger model:

lml <- lm(Y~X1+X2)

Comparison of the two models

anova(lms,lml)

Among other quantities, a p-value, labelled Pr(>F), is returned. If it is above the threshold, the larger model does not bring any new information compared to the smaller. If the p-value is small, the larger model is a better explanation of the response.

Back to the example

The objective is to select the best model. We start by comparing the two previous models

R> anova(lm(Y~RS),lm(Y~RS+TS))
Analysis of Variance Table

Model 1: Y ~ RS
Model 2: Y ~ RS + TS
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    193 45339                                  
2    192 42521  1    2817.4 12.721 0.0004564 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Interpretation The p-value of the anova is very small (0.0004564). We reject the smallest model. The model with both RS and TS explains better the score Y than the model with only RS.

Then we compare the full model with all available predictors with a model with only income, totspend.

R> SS <- mcas[,"specspend"]    # special need spending
R> SP <- mcas[,"special"]      # special education pupils
R> RA <- mcas[,"pupratio"]     # pupils per teacher ratio
R> IN <- mcas[,"income"]       # per capita income
R> AV <- mcas[,"avgsalary"]    # average teacher salary
R> lmfull <- lm(Y~RS+SS+TS+SP+RA+IN+AV)
R> lm3 <- lm(Y~TS+IN)
R> anova(lm3,lmfull)
Analysis of Variance Table

Model 1: Y ~ TS + IN
Model 2: Y ~ RS + SS + TS + SP + RA + IN + AV
  Res.Df   RSS Df Sum of Sq     F   Pr(>F)    
1    192 25947                                
2    187 21712  5    4235.1 7.295 2.88e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Interpretation The p-value of the anova is very small (2.88e-06). We reject the smallest model. The full model explains better the score Y than the model with only TS and IN.

To select the best model, one would have to consider all the partial models and compare them with anova by hand. Another way is to use selection variable using function step that compares automatically a certain sequence of models.

Back to the example

R> lmfull <- lm(Y~RS+SS+TS+SP+RA+IN+AV)
R> step(lmfull, trace=0)

Call:
lm(formula = Y ~ TS + SP + RA + IN + AV)

Coefficients:
(Intercept)           TS           SP           RA           IN           AV  
 733.541925    -0.006207    -0.468174    -2.168482     1.770759     0.590422  

R> summary(step(lmfull, trace=0))

Call:
lm(formula = Y ~ TS + SP + RA + IN + AV)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.349  -6.255   0.217   6.286  36.364 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 733.541925  12.629220  58.083  < 2e-16 ***
TS           -0.006207   0.001098  -5.656 5.66e-08 ***
SP           -0.468174   0.222641  -2.103   0.0368 *  
RA           -2.168482   0.390027  -5.560 9.11e-08 ***
IN            1.770759   0.177932   9.952  < 2e-16 ***
AV            0.590422   0.337999   1.747   0.0823 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.76 on 189 degrees of freedom
Multiple R-squared:  0.5311,    Adjusted R-squared:  0.5187 
F-statistic: 42.81 on 5 and 189 DF,  p-value: < 2.2e-16

Interpretation The best model includes TS, SP, RA, IN and AV. R-squared is 0.53: 53% of the variability of the score is explained by these predictors.

Regression in high dimension

When predictors are levels of expression for a set of genes, one wants to include them in a regression model and to select the most influent ones. When the variable to predict Y is of length n, at most n − 1 coefficients slope can be estimated. Moreover, the solution is not unique and it overfits the data as well.

If the number of predictors, say p, is larger than n, the function lm computes the n − 1 first coefficients slope and sets the other at NA.

Example We use the two data sets LenzT and LenzI. We study the link between the level of expression of genes and the age. Calculation of all the coefficients is time-consuming when p is large. To reduce this time, we subsample 1000 genes and apply a linear regression model of age onto the genes.

Note that contrary to the chapter "multiple testing", Y is now LenzI and X is LenzT (table with the levels of expression).

R> X<-readRDS("data/LenzT.rds")
R> Y<-readRDS("data/LenzI.rds")
R> names(Y)
 [1] "gender" "age"    "diagno" "status" "follup" "regim"  "ecog"   "stage" 
 [9] "ldhrat" "extnod"
R> dim(X)
[1]   414 17290
R> rg <- sample(colnames(X), 1000)
R> X <- X[,rg]
R> Ya<-Y$age
R> res<-lm(Ya~X)

Display a summary of the coefficients:

R> summary(res$coefficients)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
-310.111  -12.614   -0.384   15.562   11.324 7085.253      587

Interpretation Only n − 1 = 413 coefficients have been estimated (corresponding arbitrarily to the first columns of X and p − (n − 1)=1000 − 413 = 587 have been set to NA.

This makes interpretation of the final model impossible.

One alternative is to assume that the model is sparse.

Sparsity assumption

Consider a linear regression model with n observations for the response Y and p predictors X.

A model is sparse if only k<n parameters are nonzero in the true linear model.

This is a reasonable assumption. Indeed, the amount of information to estimate a parameter in a regression model is n/p. If p is much larger than n and if the true model is not sparse, then the amount of information is too small to estimate accurately the parameters. On the other hand, if k<n, we can estimate accurately these k parameters.

The difficulty is that we don't know in advance which k of the p predictors have no influence and should be set to zero. Even without this information, it turns out that we can still estimate the sparse model reasonably well.

The approach that allows to estimate a sparse model is called regularization approach or penalised approach.

Recall that in a non sparse model, the parameters are estimated by minimizing only the Mean Squared Error:

min (||y − intercept − slope1 * predictor1 − ...slopep * predictorp||²)

In a sparse model, we penalise the criteria to force some coefficients to be zero. Several penalties or regularisations have been introduced. We focus on what is called the lasso penalty.

Lasso penalised approach

Consider a linear regression model with n observations for the response Y and p predictors X.

A lasso estimation is obtained by minimizing

min (||y_i − intercept − slope1 * predictor1 − ...slopep * predictorp||² + λ(|slope1|+…+|slopep|)

where λ is a regularization parameter.

The lasso penalty is the additional term λ(|slope1|+…+|slopep|) in the criteria. This criteria is small when a lot of coefficients slope are set to 0. Minimizing this criteria thus induces sparsity.

The degree of sparsity depends on the value of λ. When λ is large, a lot of coefficients are set to zero. When λ is small, a lot of coefficients are non-zero.

Lasso regression is implemented in the function glmnet from the R package glmnet.

fit<-glmnet(X,Y)

We recommend not to extract the information directly from fit.

The coefficients and the degree of sparsity when λ varies can be vizualized by executing the plot function.

Back to the example First install the glmnet package and fit the model with the first 500 genes:

R> library(glmnet)
R> require(glmnet)
R> X<-readRDS("data/LenzT.rds")
R> X <- X[,1:500]
R> fit_lasso<-glmnet(X, Ya)

Then plot the fit: Code R :

R> plot(fit_lasso, xvar="lambda")

Résultat :

Interpretation

Each curve corresponds to a predictor. It shows the estimated values of slope1,..., slopep when log(λ) increases.

This plot is often called the regularisation path.

The more the curves (predictors) go to 0 on the right of the plot, the more influent they are on the response variable Y.

To annotate the curves, we add label=TRUE in the plot command: Code R :

R> plot(fit_lasso, label=TRUE, xvar="lambda")

Résultat :

On the left, we obtain the coordinate of each curve (more easy to read directly in RStudio). In this example, almost all predictors (genes) are set to 0 "late". It means that a large set of genes are highly correlated to age.

The next question in penalisation approach is to choose the "best" value for λ and then to extract, for this specific value of λ the genes that have a nonzero coefficient. This can be done by cross-validation, using all the information from the data.

Cross-validation

The regularisation parameter λ in the penalisation approach can be selected automatically by cross-validation.

The function cv.glmnet computes this best λ:

cvfit = cv.glmnet(X, Y)

The nonzero coefficients can be displayed using function print_glmnet_coefs below.

Cross-validation's idea is the following. The sample is randomly partitioned into nfolds equal sized subsamples. Of the nfolds subsamples, one is considered as a validation subsample and the estimation procedure is applied on the rest of the data, and the quality of the estimation is evaluated on the validation subsample. Then the procedure is repeated nfolds times, considering each subsample as the validation subsample. By default, the function cv.glmnet uses nfolds=10.

Back to the example

For this example, we use the entire matrix X (p=17290). First source the function print_glmnet_coefs

R> print_glmnet_coefs <- function(cvfit, s="lambda.min") {
+   ind <- which(coef(cvfit, s=s) != 0)
+   df <- data.frame(
+     feature=rownames(coef(cvfit, s=s))[ind],
+     coeficient=coef(cvfit, s=s)[ind]
+   )
+   return(df)
+ }

Then we run the cross-validation procedure and then display the nonzero coefficients:

R> X<-readRDS("data/LenzT.rds")
R> cvfit <- cv.glmnet(X, Ya)
R> print_glmnet_coefs(cvfit)
       feature  coeficient
1  (Intercept) 76.48296833
2       ADGRA3 -0.07082232
3       ANTXR2  0.05599604
4        OR8G1  0.97062384
5       SLC1A3  0.20199187
6       DNAJB1 -0.04299496
7        GSTP1 -0.19974436
8        RRAGA -1.55676845
9        NR1H3  0.18701081
10       FGF13 -0.38157241
11        SCO2  0.35296707
12    SERPINA6  0.14462150
13     GUCY1A2 -0.06936678
14       FAM3A -0.27200807
15       LRRN3 -0.34751758
16    TMEM255A -0.16716399
17        ODAM  0.40342082
18       NRARP -0.50492261
19     PHACTR3 -0.02561929
20        FUOM  0.46349781
21    HEPACAM2 -0.01813235

Interpretation

This procedure provides a list of "genes significantly linked to Y".

In our example, 13 coefficients are nonzero (including the intercept). That means 12 genes are selected as being linked with Y.

Another criteria to select λ would give another list of candidates. For example, changing the number of subsamples nfolds in the cross-validation procedure changes the final list.

These lists should be considered as propositions of candidates, not as a definitive list of significant genes. The role of the biologist is of tremendous importance to sort this list.

Logistic regression

Logistic regression is a regression model where the response is binary. Typically, this covers the case of outcomes such as dead/alive, cancer/healty, smoker/non-smoker. The two possible responses are coded 0 and 1.

The logistic model estimates the probability of response 1 based on one or more predictors, exactly in the same spirit than for the linear regression.

Linear regression can not be applied to binary response because it assumes that the response takes positive and negative values. To use the same formalism, logistic regression transforms the probability p of obtaining 1, p is a number between [0, 1], to the real scale thanks to the logit function:

The simple logistic regression of binary response Y onto X consists in writing the logit probability that Y = 1 as a linear function of X:
logit(P(Y = 1)) = a + bX

a is the intercept: it is the value of the function at the origin x = 0, or the height of the intersection with the y-axis.
b is the regression coefficient. If X is binary variable, it is interpreted as the logarithm of an odds-ratio: it is the ratio of the probability of success (Y=1) over the probability of failure for the second class of X over the ratio of the probability of success (Y=1) over the probability of failure for the first class of X.

Logistic regression is implemented in function glm:

glm(Y~X, family=binomial(link = "logit"))

Estimation of the parameters a and b is performed by maximizing the likelihood function (not detailed here). Statistical test of the influence of X onto Y are available:
H₀ : b = 0 H₁ : b ≠ 0.

The statistic is based on the slope $\hat b$ and its variance.

Residuals are called deviance residuals in logistic regression.

Example In the titanic data set, let us study the link between survival and gender.

R> titanic <- read.table("data/titanic.csv",header=TRUE,sep=";")
R> S <- titanic[ ,"survived"]
R> G <- titanic[ ,"gender"]
R> table(S,G)
     G
S       F   M
  no   96 523
  yes 292 135
R> G<-ifelse(G=="F",1,0)
R> fit <- glm(S~G, family=binomial(link="logit"))      # logistic model of S onto G
R> summary(fit)

Call:
glm(formula = S ~ G, family = binomial(link = "logit"))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6713  -0.6777  -0.6777   0.7540   1.7798  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.35431    0.09654  -14.03   <2e-16 ***
G            2.46671    0.15219   16.21   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1414.6  on 1045  degrees of freedom
Residual deviance: 1102.0  on 1044  degrees of freedom
AIC: 1106

Number of Fisher Scoring iterations: 4

Interpretation The first row gives the model. The second part displays a summary of the residuals. Then, the table with the coefficients is similar to the output of function lm. The first column gives the estimated coefficients, the last column the p-values of the test on the slope.

In this example, p-value is <2e-16. Survival is significantly different between women and men

We can then compute the odds-ratio, which is the exponential of the second coefficient b:

R> exp(coefficients(fit)[2])
       G 
11.78364

Interpretation The odds-ratio is 2.47: the odds for women is 2.47 larger than odds for men

Multiple predictors can be included in a logistic regression model in a similar way than for a linear model: glm(Y~X1+X2+..., family=binomial(link = "logit"))

When the number of predictors is not high (less than the number of individuals), selection of the best model is obtained with function step.

When the number of predictors is high, a penalisation approach using glmnet has to be used:

glmnet(X, Y, family="binomial")

Survival analysis

Survival regression model is a regression model for duration data, that is the waiting time until the occurrence of a well-defined event (death, remission, etc).

They have the specificity of being positive and right-skewed. Moreover observations are usually censored, in the sense that for some units the event of interest has not occurred at the time the data are analyzed.

Censored data

For some subjects, the event of interest has occured and the exact duration time is known.
For others, it has not occured and we only know that the duration time exceeds the observation time.

Two variables encoded this information:

a time variable often called time or followup
a binary variable equal to 1 if the event occurs, often called status

Duration data should not be analysed with standard linear regression models, that assume the normality of the response. We rather recommend the use of specific tools, namely the Cox model. The quantity that is studied in a Cox model is not directly the response, called T, but a caracterisation of the distribution of T, called the hazard function.

Hazard function

The hazard function of the duration time T is the instantaneous rate of occurence of the event. It is defined as

$$\lambda(t)=\lim_{dt\rightarrow 0} \frac{Pr(t\leq T < t+dt | T\geq t)}{dt} $$

The numerator is the conditional probability that the event will occur in the interval [t, t + dt) given that it has not occured before. The denominator is the width of the interval. Dividing one bythe other we obtain a rate of event occurence per unit of time. Taking the limit as the width of the interval goes down to zero, we obtain an instantaneous rate of occurence.

Cox model

The Cox model of T onto predictor X is

λ(t|X)=λ₀(t)exp(slope * X)

where

λ₀(t) is the hazard base function
slope is the coefficient. It can be interpreted as the logarithm of the Hazard Ratio. When X is binary, the Hazard ratio compares the risk of event for participants with X = 1 to participants with X = 0.

If the hazard ratio for a predictor is close to 1 (the coefficient is close to 0), the predictor does not affect survival. If the hazard ratio is less than 1, the predictor is protective (the survival improves for subjects with X = 1 to participants with X = 0). If the hazard ratio is greater than 1, the predictor is associated with an increase risk (the survival decreases for subjects with X = 1 to participants with X = 0).

Cox model is implemented in the coxph function, from package survival.

coxph(Surv(time,status)~X)

Example In the LenzI dataset, variable follup is a duration time variable, censored and the associated binary status variable is status. The status variable entering function coxph has to be 0/1 encoded.

R> Y<-readRDS("data/LenzI.rds")
R> Fo <- Y$follup
R> S <- ifelse(Y$status=="dead",1,0)
R> G <- Y$gender
R> library(survival)
R> require(survival)
R> fit <- coxph(Surv(Fo,S)~G)
R> summary(fit)
Call:
coxph(formula = Surv(Fo, S) ~ G)

  n= 396, number of events= 156 
   (18 observations deleted due to missingness)

         coef exp(coef) se(coef)     z Pr(>|z|)
Gmale 0.02092   1.02115  0.16180 0.129    0.897

      exp(coef) exp(-coef) lower .95 upper .95
Gmale     1.021     0.9793    0.7436     1.402

Concordance= 0.503  (se = 0.021 )
Likelihood ratio test= 0.02  on 1 df,   p=0.9
Wald test            = 0.02  on 1 df,   p=0.9
Score (logrank) test = 0.02  on 1 df,   p=0.9

Interpretation The summary gives

the estimated coefficient slope, equal to 0.02092 in this example
the exponential of the coefficient, interpreted as the hazard ratio. Close to 1 in this example
the p-value of the significance test of the slope (last column). p=0.8971 in this example.

Survival is not significantly different between men and women.

Multiple predictors can be included in cox model, the syntax is the standard one.

When the number of predictors is not high (less than the number of individuals), selection of the best model is obtained with function step.

When the number of predictors is high, a penalisation approach using glmnet has to be used:

glmnet(X, cbind(time=time, status=status), family="cox")

Note that glmnet does not allow nul values in time.

Regression models

Regression models

Linear regression

Definition of the simple linear model

Estimation of the linear model

Coefficient of determination

Test of the slope

Function lm in R

Model diagnostics

Prediction

Multiple linear regression

Regression in high dimension

Logistic regression

Survival analysis

Function `lm` in `R`