Descriptive statistics

Dataset and variables
- Some basic terms
- Organization of data
Summarizing/presenting variables
Summarizing/presenting couples of variables

Before starting with data analysis, one should be aware of different types of data and ways to organize data in computer files.

Dataset and variables

Some basic terms

Population: an aggregate of subjects (creatures, things, cases and so on).
Sample: collection of subjects in the study.
Observation: a study unit or subject or an individual.

For a given study, a target population has to be specified: on which subjects we will generalize or use the results?

In general, a sample should be representative for the target population.

The observation is often a human, sometimes also an animal, plant or anything else.

Example: LenzI.rds is a dataset. The unit of the study is a patient, the sample is the collection of the 414 patients. The population is all the patients with diffuse large B cell lymphoma.

Variable: measured or recorded for each subject in the sample.

A statistical variable is quantitative or numerical if it makes sense to add or multiply its values. A numerical variable is continuous if it can take any value in a given interval. It is discrete if the set of values it can take is fixed (e.g. a given finite set of integers).
A statistical variable which is not quantitative is qualitative. A qualitative variable is always discrete. It may be binary (true or false, male or female,. . . ), categorical, or ordinal (if values can be compared).

Dataset: a set of values of all variables of interest for all individuals in the study.

The numeric results obtained from the dataset will be used to draw conclusions about the target population.

Example: The variables of LenzI are

gender a qualitative binary variable
age a quantitative continuous variable
diagno a qualitative categorical variable
status a qualitative binary variable
follup a quantitative continuous variable
ecog a quantitative discrete variable

R> LenzI<-readRDS("data/LenzI.rds")
R> names(LenzI)
 [1] "gender" "age"    "diagno" "status" "follup" "regim"  "ecog"   "stage" 
 [9] "ldhrat" "extnod"

Organization of data

A dataset is mostly organized in a form of a , also viewed as a table or a spreadsheets. Usually, the rows correspond to the , over which are observed. Then each column contains the values of a statistical variable, over all individuals.

Example: A data matrix representing sex (1-male; 0-female), age, number of children, weight (kg), and height (cm) of 7 individuals:

Each row of such a matrix represents one observation. All rows have the same length: the same data has been recorded for all individuals. Each column represents one variable. For instance, WEIGHT is the name of a variable, representing the body weight (in kg) of an individual.

Opening a dataset usually requires the following steps:

Understand the experimental setting, the meaning of individuals and variables.
Load the dataset using the function read.table and give it a name.
Display the first six rows by head(mydata). Check if the dataset has been correctly imported. Display the number of rows and columns by dim(mydata).
Sort variables into three types:
- identifier: usually a name for each individual (e.g. names of countries, two-letter code for states,. . . ). Identifiers are not considered as statistical variables.
- discrete: all qualitative variables are discrete. A numerical variable for which the different values are fixed, is discrete. The number of different values of a discrete variable is usually much smaller than the number of individuals.
- continuous: numerical variables which can take any value in some interval are continuous. The number of different values of a continuous variable is usually (almost) as large as the number of individuals.

If you are not sure whether a numerical variable X should be treated as discrete or continuous, ask yourself whether any value between the maximum and the minimum could be taken: if the answer is yes, then X is continuous. If it is still unclear, display X: if the table is large, treat X as continuous.

Example: Load the dataset hypoxy.

R> HY <- read.table("data/hypoxy.csv", header=TRUE, dec=",")
R> head(HY)
   Level Name_Prot Hypoxy Training N_Rat Location
1 0.9843      RyR2     No       No    N1       TA
2 0.9419      RyR2     No       No    N2       TA
3 0.7761      RyR2     No       No    N5       TA
4 0.8668      RyR2     No       No    N7       TA
5 1.2249      RyR2     No       No    N9       TA
6 1.2061      RyR2     No       No   N10       TA
R> dim(HY)
[1] 214   6

Level is a continuous variable; Name_Prot, Hypoxy, Training, N_Rat, Location are qualitative variables.

Summarizing/presenting variables

Qualitative variable

Let X be a qualitative variable. Mean and standard deviation do not make much sense. To see the distribution of the variable, we can compute:

table(X) absolute frequency of a value, i.e. the number of individuals for which X takes that value.

Example:

From the dataset Hy, we can compute the absolute frequency of the variables Hypoxy.

R> Hy<-HY$Hypoxy
R> table(Hy)
Hy
 No Yes 
109 105

Interpretation: There are 105 rats with hypoxy and 109 without.

The absolute frequency (number) of patients for each diagnosis is obtained with:

R> LenzI<-readRDS("data/LenzI.rds")
R> D<-LenzI$diagno
R> table(D)
D
         ABC          GCB Unclassified 
         167          183           64

Interpretation: 167 patients have the diagnosis ABC.

We can also compute the relative frequency:

prop.table(table(X)) relative frequency of a value, i.e. its absolute frequency, divided by the sample size n. It is the proportion or percentage of individuals with that value.

Example: Relative frequencies of rats without and with hypoxy:

R> prop.table(table(Hy))
Hy
       No       Yes 
0.5093458 0.4906542

The proportion of rats with hypoxy is 0.49.

Relative frequencies of diagnosis:

R> prop.table(table(D))
D
         ABC          GCB Unclassified 
   0.4033816    0.4420290    0.1545894

Interpretation: 44.2% of the patients have the diagnosis GCB.

A qualitative variable can be graphically represented by a barplot:

barplot() A bar plot represents frequencies as vertical bars.

Pie charts also exist (pie()) but are usually misleading and should be avoided.

Example: Barplot of the diagnosis: Code R :

R> barplot(table(D), main="Absolute frequencies of diagnosis", ylab="frequency")

Résultat :

Interpretation: GCB is the most represented diagnosis. This is the mode.

One could sort the frequencies to ease the reading: Code R :

R> barplot(sort(table(D)), main="Absolute frequencies of diagnosis", ylab="frequency")

Résultat :

A graph should communicate useful information more efficiently than any other (numeric) summaries
A good graph should have a good information to ink ratio – avoid fancy details, that do not add information, but make the graph more complicated (larger, more colorful).
Pay attention to the scale of the graph!

Discrete quantitative variable

Let X be a discrete quantitative variable.

A discrete quantitative variable can be described by

table(X) absolute frequencies
prop.table(table(X)) relative frequencies
mean(X) empirical mean: arithmetic average of the variable, sum of all the values divided by the total sample size:
$$\mbox{mean}(X) = \frac1n\sum_{i=1}^n X_i$$
median(X) smallest value such that at least 50% of the values are smaller or equal, middle point of ordered data
var(X), sd(X) empirical variance, standard-deviation: standard deviation can be understood as an average distance from the values to their mean, it reflects the variability of the sample. It is formally defined as
$$\mbox{var}(X) = \frac1{n-1}\sum_{i=1}^n (X_i-\mbox{mean}(X))^2$$
Standard deviation gives a sens of how dispersed (spread out) the data in your sample is from the mean.
quantile(X, c(0.25, 0.75)) first and third quartiles, smallest values such that at least 25% and 75% of the values are smaller or equal.
IQR(X) interquartile range is the span between the first and third quartiles

Example: Description of the number of extranodal sites in the LenzI dataset.

R> E<-LenzI$extnod
R> table(E)
E
  0   1   2   3   4   5 
238 115  19   8   2   1 
R> round(prop.table(table(E)),3)
E
    0     1     2     3     4     5 
0.621 0.300 0.050 0.021 0.005 0.003

Interpretation: 62.1% of patients have no extranodal sites.

Note that the relative frequencies are rounded with only 3 digits (function round).

Computation of the mean, median, variance, standard deviation is not direct because the variable extnod has missing values.

R> mean(E)
[1] NA
R> table(is.na(E))

FALSE  TRUE 
  383    31

Interpretation: There are 31 missing values in variable E.

We can compute the mean of the existing values by removing the missing values. RStudio automatically removes them when one includes the option na.rm in functions mean, sd.

R> mean(E, na.rm = TRUE)
[1] 0.4960836
R> sd(E, na.rm=TRUE)
[1] 0.7687348

Interpretation: The average number of extranodal sites is 0.50. This is coherent with the proportion of 0 and 1 extranodal site (62.1% and 30% respectively) and the fact that the maximal value is 5. The standard deviation is 0.76, this is small compared to the range of possible values (from 0 to 5 sites). A small standard deviation means that the data are concentrated around the mean.

Let us now compute the quartiles.

R> median(E, na.rm=TRUE)
[1] 0
R> quantile(E, c(0.25, 0.75), na.rm=TRUE)
25% 75% 
  0   1

Interpretation: The median is 0: at least 50% of the patients have 0 extranodal site. The third quartile is 1: at least 75% of the patients have less than 1 extranodal site.

The function summary computes the three quartiles, the mean, the minimum and maximum values, and the number of missing values.

R>  summary(E)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.4961  1.0000  5.0000      31

A discrete variable can be represented by a barplot.

Code R :

R> barplot(table(E), ylab="absolute frequency", xlab="number of sites")
R> barplot(prop.table(table(E)), ylab="proportion", xlab="number of sites")

Résultat :

Interpretation: The mode of the variable is 0 site. The distribution is not symmetric.

Continous quantitative variable

Let X be a continuous quantitative variable.

A continuous quantitative variable can be described by

mean(X) empirical mean: arithmetic average of the variable, sum of all the values divided by the total sample size
median(X) smallest value such that at least 50% of the values are smaller or equal, middle point of ordered data
var(X), sd(X) empirical variance, standard-deviation: standard deviation can be understood as an average distance from the values to their mean, it reflects the variability of the sample. It is formally defined as the square root of the variance, the variance being the mean of squares minus the square of mean, that difference being multiplied by n/(n-1).
quantile(X, c(0.25, 0.75)) first and third quartiles, smallest values such that at least 25% and 75% of the values are smaller or equal.
IQR(X)interquartile range is the span between the first and third quartiles

Example: Description of patient age in the LenzI data set.

R> A<-LenzI$age
R> summary(A)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  14.00   52.00   62.50   61.14   73.75   92.00 
R> sd(A)
[1] 15.44559
R> IQR(A)
[1] 21.75

Interpretation: The mean age is 61.14 years and the standard deviation (average distance to the mean) is 15.44 years. The data are rather dispersed. 50% of the population is older than 62.5 years, only 25% of the population is younger than 52 years.

A continuous quantitative variable can be represented by

boxplot(X) a boxplot: a graphical representation of median and quartiles. It gives an overview of the distribution of the data. [Alt text]{/path/to/boxplot.pdf}
plot(ecdf(X)) the empirical cumulative distribution function that displays the quantiles
hist(X) a histogram: the scale of the variable is divided to consecutive intervals of equal length and the number of observations in each interval is counted.
hist(X, prob= TRUE) a density histogram: the bar areas sum to 1. It gives a rough sense of the distribution of the data. Distributions can be symmetric, unimodal, skewed right, skewed left, bimodal, etc.

Code R :

R> par(mfrow=c(2,2))
R> hist(rnorm(1000, 10,1), prob=T, xlab="", main = "symmetric", cex.main=3)
R> hist(rexp(5000, 10), prob=T, xlab="", main = "skewed right", cex.main=3)
R> hist(1-rexp(5000, 10), prob=T, xlab="", main = "skewed left", cex.main=3)
R> hist(c(rnorm(1000, 10,1), rnorm(1000, 15,1)), prob=T, xlab="", main = "bimodal", cex.main=3)

Résultat :

Example: Distribution of the variable age in LenzIdataset.

Interpretation: the limits of the box are the first and third quartiles, the bold line is the median. The outliers are represented with circles and defined as values less than 3/2 times of first quartile or vales more than 3/2 times of third quartile. The above whisker is the least value excluding the ouliers. The top whisker is the greatest value excluding outliers. We can see that the distribution is left skewed: some values are below the whisker.

Interpretation: The variable age is left skewed, unimodal.

Code R :

R> plot(ecdf(A))
R> abline(v=median(A), col="red")
R> abline(h=0.5, col="red")
R> abline(v=quantile(A, 0.25), col="green")
R> abline(h=0.25, col="green")
R> abline(v=quantile(A, 0.75), col="blue")
R> abline(h=0.75, col="blue")

Résultat :

One can read the median (red), the first and third quartiles (green and blue) on the ecdf plot.

Summarizing/presenting couples of variables

Two qualitative or discrete variables

Let (X, Y) be a couple of discrete variables, observed on the same population. Some definitions to describe the distribution of the two variables

The joint absolute frequency of a couple of values (x, y) is the number of individuals for which X is x and Y is y.
table(X,Y) : contingency table, a rectangular array containing joint absolute frequencies.
prop.table(table(X,Y)): joint relative frequency of a couple of values (x, y) is the proportion of individuals for which X is x and Y is y. It is the joint absolute frequency divided by the sample size n.
prop.table(table(X,Y),1): conditional frequency of a value y of Y, given a value x of X, is the joint absolute frequency of (x, y), divided by the absolute frequency of x; or else, the proportion of individuals for which Y is y among those for which X is x.
The two variables are associated if the conditional distributions of one given the value of the other are different.

Example: Joint distribution of variables regim and stage of LenzI dataset.

R> R<-LenzI$regim
R> S<-LenzI$stage
R> table(R,S)
        S
R         1  2  3  4
  CHOP   28 55 39 58
  R-CHOP 38 67 58 63
R> prop.table(table(R,S)) # marginal distribution
        S
R                 1          2          3          4
  CHOP   0.06896552 0.13546798 0.09605911 0.14285714
  R-CHOP 0.09359606 0.16502463 0.14285714 0.15517241

Interpretation: The proportion of patients with regimen CHOP and stage 2 is 13.5%.

A barplot represents the joint distribution with the option beside.

Code R :

R> barplot(table(R,S), beside=TRUE, legend.text=c("CHOP", "R-CHOP"))

Résultat :

Interpretation:

The conditional distribution is informative when studying the association between the two variables:

R> R<-LenzI$regim
R> S<-LenzI$stage
R> prop.table(table(R,S),1) # row percentages
        S
R                1         2         3         4
  CHOP   0.1555556 0.3055556 0.2166667 0.3222222
  R-CHOP 0.1681416 0.2964602 0.2566372 0.2787611
R> prop.table(table(R,S),2) # column percentages
        S
R                1         2         3         4
  CHOP   0.4242424 0.4508197 0.4020619 0.4793388
  R-CHOP 0.5757576 0.5491803 0.5979381 0.5206612

The proportion of patients with stage 2 among patients with regimen R-CHOP is 29.64%. The proportion of patients with regimen CHOP among patients with stage 4 is 52%.

The visualisation of the conditional frequencies shows that the conditional distributions are not very different: Code R :

R> TRS<-prop.table(table(R,S),1) # row percentages
R> barplot(TRS, beside=TRUE, legend.text=c("CHOP", "R-CHOP"))

Résultat :

A qualitative and a quantitative variable

Let X be a qualitative or discrete variable and Y a continuous variable, observed on the same population. Some definitions to describe the distribution of the two variables

by(Y,X, summary): conditional distribution of Y, given a value x of X
boxplot(Y~X): boxplot of the conditional distribution of Y, given a value x of X

Example: Conditional distribution of variables followup given the regimen, from LenzI dataset.

R> Fo<-LenzI$follup
R> by(Fo, R, summary)
R: CHOP
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.94    2.68    4.15    6.78   21.78 
------------------------------------------------------------ 
R: R-CHOP
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.890   2.120   2.402   3.650  10.290

The mean of followup among patients with regimen CHOP is 4.15, while the mean among patients with regimen R-CHOP is 2.402. The two conditional distributions are rather different. This is confirmed by the two boxplots of the follow-up given the type of regimen:

Code R :

R> boxplot(Fo~R)

Résultat :

Two quantitative variables

Let (X, Y) be a couple of continuous variables, observed on the same population.

plot(X,Y): the scatter plot represents couples of values, understood as coordinates of points in the plane.
cov(X,Y): covariance of X and Y, the mean of products minus the product of means, that difference being multiplied by n/(n-1), where n is the sample size.
cor(X,Y): correlation coefficient of X and Y is the covariance, divided by the product of standard deviations. It is a dimensionless quantity, always between −1 and +1, which describes the relationship between the two variables. If the correlation is positive, the two variables tend to vary in the same direction: if one increases, the other one does too. If the correlation is negative, the variables vary in opposite directions. The closer the correlation is to plus or minus 1, the closer the scatter plot to a straight line, and the stronger the association of the two variables.

Example: Correlation of two transcriptomes from LenzT dataset:

R> LenzT<-readRDS("data/LenzT.rds")
R> dim(LenzT)
[1]   414 17290
R> G1<-LenzT[1,]
R> G2<-LenzT[2,]
R> cor(G1,G2)
[1] 0.868667

The correlation is high and non-negative, as confirmed by the scatter plot which has a strong structure Code R :

R> plot(G1,G2)

Résultat :

At the contrary, looking at the scatter plot of genes reveals no clear form: Code R :

R> D1<-LenzT[,"DDR1"]
R> R2<-LenzT[,"RFC2"]
R> plot(D1,R2)

Résultat :