Descriptive statistics
Descriptive statistics
Main PageBefore starting with data analysis, one should be aware of different types of data and ways to organize data in computer files.
Dataset and variables
Some basic terms
For a given study, a target population has to be specified: on which subjects we will generalize or use the results?
In general, a sample should be representative for the target population.
The observation is often a human, sometimes also an animal, plant or anything else.
Example: LenzI.rds
is a dataset. The unit of the study is a patient, the sample is the collection of the 414 patients. The population is all the patients with diffuse large B cell lymphoma.
The numeric results obtained from the dataset will be used to draw conclusions about the target population.
Example: The variables of LenzI
are
gender
a qualitative binary variableage
a quantitative continuous variablediagno
a qualitative categorical variablestatus
a qualitative binary variablefollup
a quantitative continuous variableecog
a quantitative discrete variable
R> LenzI<-readRDS("data/LenzI.rds") R> names(LenzI) [1] "gender" "age" "diagno" "status" "follup" "regim" "ecog" "stage" [9] "ldhrat" "extnod"
Organization of data
A dataset is mostly organized in a form of a , also viewed as a table or a spreadsheets. Usually, the rows correspond to the , over which are observed. Then each column contains the values of a statistical variable, over all individuals.
Example: A data matrix representing sex (1-male; 0-female), age, number of children, weight (kg), and height (cm) of 7 individuals:Each row of such a matrix represents one observation. All rows have the same length: the same data has been recorded for all individuals. Each column represents one variable. For instance, WEIGHT is the name of a variable, representing the body weight (in kg) of an individual.
Opening a dataset usually requires the following steps:
- Understand the experimental setting, the meaning of individuals and variables.
- Load the dataset using the function
read.table
and give it a name. - Display the first six rows by
head(mydata)
. Check if the dataset has been correctly imported. Display the number of rows and columns bydim(mydata)
. - Sort variables into three types:
- identifier: usually a name for each individual (e.g. names of countries, two-letter code for states,. . . ). Identifiers are not considered as statistical variables.
- discrete: all qualitative variables are discrete. A numerical variable for which the different values are fixed, is discrete. The number of different values of a discrete variable is usually much smaller than the number of individuals.
- continuous: numerical variables which can take any value in some interval are continuous. The number of different values of a continuous variable is usually (almost) as large as the number of individuals.
If you are not sure whether a numerical variable X should be treated as discrete or continuous, ask yourself whether any value between the maximum and the minimum could be taken: if the answer is yes, then X is continuous. If it is still unclear, display X: if the table is large, treat X as continuous.
Example: Load the dataset hypoxy.R> HY <- read.table("data/hypoxy.csv", header=TRUE, dec=",") R> head(HY) Level Name_Prot Hypoxy Training N_Rat Location 1 0.9843 RyR2 No No N1 TA 2 0.9419 RyR2 No No N2 TA 3 0.7761 RyR2 No No N5 TA 4 0.8668 RyR2 No No N7 TA 5 1.2249 RyR2 No No N9 TA 6 1.2061 RyR2 No No N10 TA R> dim(HY) [1] 214 6
Level is a continuous variable; Name_Prot, Hypoxy, Training, N_Rat, Location are qualitative variables.
Summarizing/presenting variables
Qualitative variable
Let X be a qualitative variable. Mean and standard deviation do not make much sense. To see the distribution of the variable, we can compute:
Example:
From the datasetHy
, we can compute the absolute frequency of the variables Hypoxy
.
R> Hy<-HY$Hypoxy R> table(Hy) Hy No Yes 109 105The absolute frequency (number) of patients for each diagnosis is obtained with:
R> LenzI<-readRDS("data/LenzI.rds") R> D<-LenzI$diagno R> table(D) D ABC GCB Unclassified 167 183 64We can also compute the relative frequency: Example: Relative frequencies of rats without and with hypoxy:
R> prop.table(table(Hy)) Hy No Yes 0.5093458 0.4906542Relative frequencies of diagnosis:
R> prop.table(table(D)) D ABC GCB Unclassified 0.4033816 0.4420290 0.1545894
A qualitative variable can be graphically represented by a barplot:
Pie charts also exist (pie()
) but are usually misleading and should be avoided.
R> barplot(table(D), main="Absolute frequencies of diagnosis", ylab="frequency")
Résultat :
R> barplot(sort(table(D)), main="Absolute frequencies of diagnosis", ylab="frequency")
Résultat :
Discrete quantitative variable
Let X be a discrete quantitative variable.
Example: Description of the number of extranodal sites in theLenzI
dataset.
R> E<-LenzI$extnod R> table(E) E 0 1 2 3 4 5 238 115 19 8 2 1 R> round(prop.table(table(E)),3) E 0 1 2 3 4 5 0.621 0.300 0.050 0.021 0.005 0.003
Note that the relative frequencies are rounded with only 3 digits (function round
).
extnod
has missing values.
R> mean(E) [1] NA R> table(is.na(E)) FALSE TRUE 383 31
We can compute the mean of the existing values by removing the missing values. RStudio automatically removes them when one includes the option na.rm
in functions mean
, sd
.
R> mean(E, na.rm = TRUE) [1] 0.4960836 R> sd(E, na.rm=TRUE) [1] 0.7687348Let us now compute the quartiles.
R> median(E, na.rm=TRUE) [1] 0 R> quantile(E, c(0.25, 0.75), na.rm=TRUE) 25% 75% 0 1The function
summary
computes the three quartiles, the mean, the minimum and maximum values, and the number of missing values.
R> summary(E) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.0000 0.0000 0.0000 0.4961 1.0000 5.0000 31
A discrete variable can be represented by a barplot
.
R> barplot(table(E), ylab="absolute frequency", xlab="number of sites") R> barplot(prop.table(table(E)), ylab="proportion", xlab="number of sites")
Résultat :
Continous quantitative variable
Let X be a continuous quantitative variable.
Example: Description of patient age in the LenzI data set.R> A<-LenzI$age R> summary(A) Min. 1st Qu. Median Mean 3rd Qu. Max. 14.00 52.00 62.50 61.14 73.75 92.00 R> sd(A) [1] 15.44559 R> IQR(A) [1] 21.75Code R :
R> par(mfrow=c(2,2)) R> hist(rnorm(1000, 10,1), prob=T, xlab="", main = "symmetric", cex.main=3) R> hist(rexp(5000, 10), prob=T, xlab="", main = "skewed right", cex.main=3) R> hist(1-rexp(5000, 10), prob=T, xlab="", main = "skewed left", cex.main=3) R> hist(c(rnorm(1000, 10,1), rnorm(1000, 15,1)), prob=T, xlab="", main = "bimodal", cex.main=3)
Résultat :
Example: Distribution of the variable age in LenzI
dataset.
R> plot(ecdf(A)) R> abline(v=median(A), col="red") R> abline(h=0.5, col="red") R> abline(v=quantile(A, 0.25), col="green") R> abline(h=0.25, col="green") R> abline(v=quantile(A, 0.75), col="blue") R> abline(h=0.75, col="blue")
Résultat :
One can read the median (red), the first and third quartiles (green and blue) on the ecdf
plot.
Summarizing/presenting couples of variables
Two qualitative or discrete variables
Let (X, Y) be a couple of discrete variables, observed on the same population. Some definitions to describe the distribution of the two variables Example: Joint distribution of variables regim and stage ofLenzI
dataset.
R> R<-LenzI$regim R> S<-LenzI$stage R> table(R,S) S R 1 2 3 4 CHOP 28 55 39 58 R-CHOP 38 67 58 63 R> prop.table(table(R,S)) # marginal distribution S R 1 2 3 4 CHOP 0.06896552 0.13546798 0.09605911 0.14285714 R-CHOP 0.09359606 0.16502463 0.14285714 0.15517241
A barplot
represents the joint distribution with the option beside
.
R> barplot(table(R,S), beside=TRUE, legend.text=c("CHOP", "R-CHOP"))
Résultat :
R> R<-LenzI$regim R> S<-LenzI$stage R> prop.table(table(R,S),1) # row percentages S R 1 2 3 4 CHOP 0.1555556 0.3055556 0.2166667 0.3222222 R-CHOP 0.1681416 0.2964602 0.2566372 0.2787611 R> prop.table(table(R,S),2) # column percentages S R 1 2 3 4 CHOP 0.4242424 0.4508197 0.4020619 0.4793388 R-CHOP 0.5757576 0.5491803 0.5979381 0.5206612
The proportion of patients with stage 2 among patients with regimen R-CHOP is 29.64%. The proportion of patients with regimen CHOP among patients with stage 4 is 52%.
The visualisation of the conditional frequencies shows that the conditional distributions are not very different: Code R :R> TRS<-prop.table(table(R,S),1) # row percentages R> barplot(TRS, beside=TRUE, legend.text=c("CHOP", "R-CHOP"))
Résultat :
A qualitative and a quantitative variable
Let X be a qualitative or discrete variable and Y a continuous variable, observed on the same population. Some definitions to describe the distribution of the two variables Example: Conditional distribution of variables followup given the regimen, fromLenzI
dataset.
R> Fo<-LenzI$follup R> by(Fo, R, summary) R: CHOP Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 0.94 2.68 4.15 6.78 21.78 ------------------------------------------------------------ R: R-CHOP Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.890 2.120 2.402 3.650 10.290
The mean of followup among patients with regimen CHOP is 4.15, while the mean among patients with regimen R-CHOP is 2.402. The two conditional distributions are rather different. This is confirmed by the two boxplots of the follow-up given the type of regimen:
Code R :R> boxplot(Fo~R)
Résultat :
Two quantitative variables
Let (X, Y) be a couple of continuous variables, observed on the same population.
Example: Correlation of two transcriptomes fromLenzT
dataset:
R> LenzT<-readRDS("data/LenzT.rds") R> dim(LenzT) [1] 414 17290 R> G1<-LenzT[1,] R> G2<-LenzT[2,] R> cor(G1,G2) [1] 0.868667The correlation is high and non-negative, as confirmed by the scatter plot which has a strong structure Code R :
R> plot(G1,G2)
Résultat :
R> D1<-LenzT[,"DDR1"] R> R2<-LenzT[,"RFC2"] R> plot(D1,R2)
Résultat :