Data exploration

Main Page

Getting started with R
Opening a dataset
- Some basic terms
- Organization of data
Vocabulary of statistics
Summarizing/presenting couples of variables

Getting started with R

First of all:

Install R from the web site Rproject. R is a open-source (i.e. free) software.
Install Rstudio from the web site RStudio. RStudio allows the user to run R in a more user-friendly environment. It is a open-source (i.e. free) software.

RStudio screen is divided in several windows.

Alt text

The console (lower left corner) is where you can type the commands and see output. Run the commands with Enter.
The script (upper left corner) is where you can write the 'script' to save all your commands. Run them down to the lower left corner by Ctrl+R (Windows) or Cmd+Enter (Mac).
The workspace (upper right corner) tab shows all the active objects. The workspace tab stores any object, value, function or anything you create during your R session. The {} tab shows a list of commands used so far. The history tab keeps a record of all previous commands.
The files (lower right corner) tab show all the files and folders in your default workspace as if you were on a PC/Mac window. The plots tab will list a series of packages or add-ons needed to run certain processes. For additional info, see the help tab.

Create an object

An object can be created with the "assign" operator which is written as an arrow with a bracket and a minus sign.

R>  n <- 10

One of the simplest commands is to type the name of an object to display its content.

R> n
[1] 10

The digit 1 within brackets indicates that the display starts at the first element of n.

R is case sensitive (lower case different from capital letters).

R> x <-1
R> X <-10
R> x
[1] 1
R> X
[1] 10

Note that you can type an expression without assigning its value to an object, the result is displayed on the console but not stored in memory:

R> (10 + 2) * 5
[1] 60

Basic R commands

a:b creates a vector of values from a to b.

Example: to create the vector (1,2,3,4,5), use the command

R> 1:5
[1] 1 2 3 4 5

The resulting vector has 5 elements. Arithmetic operators can be used:

R> 1:5-1
[1] 0 1 2 3 4
R> 1:(5-1)
[1] 1 2 3 4

c(X,Y) concatenates two or more vectors or values in a row to make a new vector.

Example: add a value 6 to the vector X equal to (1,2,3,4,5):

R> X <- 1:5
R> c(X,6)
[1] 1 2 3 4 5 6

rep(a, n) creates a vector with all its elements identical.

Example: create a vector with 10 elements equal to 1:

R> rep(1, 10)
 [1] 1 1 1 1 1 1 1 1 1 1

cbind(X,Y) binds two or more vectors with same length, as columns.

Example: Create first the vector Y <- c(1,4,9, 16, 25). Check the length of Y and bind the two vectors X and Y:

R> Y <- c(1,4,9, 16, 25)
R> cbind(X,Y)
     X  Y
[1,] 1  1
[2,] 2  4
[3,] 3  9
[4,] 4 16
[5,] 5 25

rbind(X,Y) binds two or more vectors with same length, as rows.

Example: Bind the two vectors X and Y in row:

R> rbind(X,Y)
  [,1] [,2] [,3] [,4] [,5]
X    1    2    3    4    5
Y    1    4    9   16   25

matrix(vec, nrow) creates a matrix with values of vector vec with nrow rows.

Example: Create a matrix with elements from 1 to 6, with 2 rows and 3 columns:

R> A <- matrix(1:6,2)
R> A
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

dim(XY) number of rows and columns of a matrix XY.

Example:

R> XY <- cbind(1:10,11:20)
R> dim(XY)
[1] 10  2

Accessing the values of an object

X[ind] extracts from vector X, coordinates specified by ind. The entry ind may be:

a vector of integers specifying the indices to be selected,
a vector of Booleans (the TRUE coordinates will be selected),
a vector of names to be selected (if X has names).

Example: display the third element of X and Y:

XY[row,col] extracts from matrix or data frame XY, cells corresponding to particular rows and columns. The entries row and col may be:

empty (understood as “all” rows or columns),
a vector of integers specifying the indices to be selected,
a vector of Booleans (the TRUE coordinates will be selected),
a vector of row or column names to be selected.

Example:

head(XY) first 6 rows of matrix or data frame XY.

Example:

R> head(XY)
     [,1] [,2]
[1,]    1   11
[2,]    2   12
[3,]    3   13
[4,]    4   14
[5,]    5   15
[6,]    6   16

A$a Accessing the values of an object A with names.

Example:

R> colnames(A) <- c("a", "b", "c")
R>  A
     a b c
[1,] 1 3 5
[2,] 2 4 6
R>  A<-as.data.frame(A)
R>  A$a
[1] 1 2

A few functions

sum(X) adds all values in vector X. If X is Boolean (TRUE/FALSE), the number of TRUE’s is returned.

Example:

R> sum(X)
[1] 15

cumsum(X) returns the vector of cumulative sums for the values in X: first, first plus second, first plus second plus third, etc.

Example:

R> cumsum(X)
[1]  1  3  6 10 15

rowSums(XY), colSums(XY) sums in each row, in each column of a matrix XY.

Example:

R> rowSums(XY)
 [1] 12 14 16 18 20 22 24 26 28 30
R> colSums(XY)
[1]  55 155

Operators

Arithmetic operators

+ addition
- subtraction
* multiplication
/ division
^ power

R> 2^2
[1] 4

Comparison operators

< lesser than
> greater than
<= lesser than or equal to
>= greater than or equal to
== equal
!= different

Example

R> x <- 0.5
R> (0 < x)
[1] TRUE
R> x <- 1:3
R> y <- 1:3
R> (x == y)
[1] TRUE TRUE TRUE

Logical operators

!x logical NOT
x & y logical AND
x && y id.
x | y logical OR
x || y id.

Example:

R> x<-1:6
R> y<-4:9
R> (x>5)
[1] FALSE FALSE FALSE FALSE FALSE  TRUE
R> ! (x>5)
[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
R> (x<3)&(y>4)
[1] FALSE  TRUE FALSE FALSE FALSE FALSE
R> (x<5)&(y>4)
[1] FALSE  TRUE  TRUE  TRUE FALSE FALSE

Reading data in a file

For reading and writing in files, R uses the working directory. To find this directory, the command getwd() (get working directory) can be used, and the working directory can be changed with setwd("C:/data"). It is necessary to give the path to a file if it is not in the working directory.

Data can be read with the function read.table or scan. The function read.table creates a data frame. For instance, a file named datafile.csv can be read:

R> mydata <- read.table("data/datafile.csv")

In that command, mydata is the name you choose for the data frame. By defaults, each variable of the data frame is named V1, V2, .... They can be accessed individually by mydata$V1, mydata$V2, ... or by mydata["V1"], mydata["V2"], ... or by mydata[, 1], mydata[, 2], ....

All the options of the function read.table are described in the help file. For example, if the file contains the names of the variables on its first line, we can use the option header, if the cells are separated by ; , we can use the option sep=";":

R> XY <- read.table("data/datafile.csv",header=TRUE,sep=";")

Example To upload the data set called bosson.csv, run the following instruction:

R> B <- read.table("data/bosson.csv", header=TRUE, sep=";")

To check if the data were correctly loaded, use the function head that displays the first 6 rows of the dataset:

R> head(B)
  country gender aneurysm    bmi risk
1 Vietnam      M       21 21.094    0
2 Vietnam      M       27 19.031    0
3 Vietnam      M       28 20.313    0
4 Vietnam      F       33 17.778    0
5  France      F       34 21.604    0
6 Vietnam      F       35 21.096    0

All the values of the first six rows are displayed, the first two are categorical, the others are numerical.

Opening a dataset

Before starting with data analysis, one should be aware of different types of data and ways to organize data in computer files.

Some basic terms

Population: an aggregate of subjects (creatures, things, cases and so on).
Sample: collection of subjects in the study.
Observation: a study unit or subject or an individual.

For a given study, a target population has to be specified: on which subjects we will generalize or use the results?

In general, a sample should be representative for the target population.

The observation is often a human, sometimes also an animal, plant or anything else.

Example: In the Bosson dataset, the unit of the study is a patient, the sample is the collection of the 209 patients. The population is all the patients with aneurysm.

Variable: measured or recorded for each subject in the sample.

A statistical variable is quantitative or numerical if it makes sense to add or multiply its values. A numerical variable is continuous if it can take any value in a given interval. It is discrete if the set of values it can take is fixed (e.g. a given finite set of integers).
A statistical variable which is not quantitative is qualitative. A qualitative variable is always discrete. It may be binary (true or false, male or female,. . . ), categorical, or ordinal (if values can be compared).

Dataset: a set of values of all variables of interest for all individuals in the study.

The numeric results obtained from the dataset will be used to draw conclusions about the target population.

Example: The variables of Bosson are

country qualitative categorical variable
gender a qualitative categorical variable
aneurysm a quantitative continuous variable
bmi a quantitative continuous variable
risk a quantitative discrete variable

R> B <- read.table("data/bosson.csv", header=TRUE, sep=";")
R> names(B)
[1] "country"  "gender"   "aneurysm" "bmi"      "risk"

Organization of data

A dataset is mostly organized in a form of a matrix, also viewed as a table or a spreadsheets. Usually, the rows correspond to the individuals, over which statistical variables are observed. Then each column contains the values of a statistical variable, over all individuals.

Example: A data matrix representing sex (1-male; 0-female), age, number of children, weight (kg), and height (cm) of 7 individuals:

Each row of such a matrix represents one observation. All rows have the same length: the same data has been recorded for all individuals. Each column represents one variable. For instance, WEIGHT is the name of a variable, representing the body weight (in kg) of an individual.

Opening a dataset usually requires the following steps:

Understand the experimental setting, the meaning of individuals and variables.
Load the dataset using the function read.table and give it a name.
Display the first six rows by head(mydata). Check if the dataset has been correctly imported. Display the number of rows and columns by dim(mydata).
Sort variables into three types:
- identifier: usually a name for each individual (e.g. names of countries, two-letter code for states,. . . ). Identifiers are not considered as statistical variables.
- discrete: all qualitative variables are discrete. A numerical variable for which the different values are fixed, is discrete. The number of different values of a discrete variable is usually much smaller than the number of individuals.
- continuous: numerical variables which can take any value in some interval are continuous. The number of different values of a continuous variable is usually (almost) as large as the number of individuals.

If you are not sure whether a numerical variable X should be treated as discrete or continuous, ask yourself whether any value between the maximum and the minimum could be taken: if the answer is yes, then X is continuous. If it is still unclear, display X: if the table is large, treat X as continuous.

Example: Load the dataset Bosson.

R> B <- read.table("data/bosson.csv", header=TRUE, sep=";")
R> head(B)
  country gender aneurysm    bmi risk
1 Vietnam      M       21 21.094    0
2 Vietnam      M       27 19.031    0
3 Vietnam      M       28 20.313    0
4 Vietnam      F       33 17.778    0
5  France      F       34 21.604    0
6 Vietnam      F       35 21.096    0
R> dim(B)
[1] 209   5

Country, Gender are categorical variables; Aneurysm, bmi, risk are quantitative variables.

Vocabulary of statistics

Qualitative variable

Let X be a qualitative variable. Mean and standard deviation do not make much sense. To see the distribution of the variable, we can compute:

table(X) absolute frequency of a value, i.e. the number of individuals for which X takes that value.

Example:

From the dataset B, we can compute the absolute frequency of the variable country.

R> C<-B$country
R> table(C)
C
 France Vietnam 
     99     110

Interpretation: There are 110 Vietnamese people and 99 French.

The absolute frequency (number) of male/female is obtained with:

R> G<-B$gender
R> table(G)
G
  F   M 
 51 158

Interpretation: 51 patients are female, 158 are male.

We can also compute the relative frequency:

prop.table(table(X)) relative frequency of a value, i.e. its absolute frequency, divided by the sample size n. It is the proportion or percentage of individuals with that value.

Example: Relative frequencies of patients without any risk factor:

R> Ri<-B$risk
R> prop.table(table(Ri))
Ri
          0           1           2           3           4           5 
0.191387560 0.387559809 0.291866029 0.105263158 0.019138756 0.004784689

The proportion of patients without any risk factor is 0.19.

Relative frequencies of Vietnamese:

R> prop.table(table(G))
G
        F         M 
0.2440191 0.7559809

Interpretation: 52% of the patients are Vietnamese.

A qualitative variable can be graphically represented by a barplot:

barplot() A bar plot represents frequencies as vertical bars.

Pie charts also exist (pie()) but are usually misleading and should be avoided.

Example: Barplot of the diagnosis: Code R :

R> barplot(table(Ri), main="Absolute frequencies of number of risk factors", ylab="frequency")

Résultat :

Interpretation: Patients with one risk factor are the most represented. One risk factor is the mode.

A graph should communicate useful information more efficiently than any other (numeric) summaries
A good graph should have a good information to ink ratio – avoid fancy details, that do not add information, but make the graph more complicated (larger, more colorful).
Pay attention to the scale of the graph!

Discrete quantitative variable

Let X be a discrete quantitative variable.

A discrete quantitative variable can be described by

table(X) absolute frequencies
prop.table(table(X)) relative frequencies
mean(X) empirical mean: arithmetic average of the variable, sum of all the values divided by the total sample size:
$$\mbox{mean}(X) = \frac1n\sum_{i=1}^n X_i$$
median(X) smallest value such that at least 50% of the values are smaller or equal, middle point of ordered data
var(X), sd(X) empirical variance, standard-deviation: standard deviation can be understood as an average distance from the values to their mean, it reflects the variability of the sample. It is formally defined as
$$\mbox{var}(X) = \frac1{n-1}\sum_{i=1}^n (X_i-\mbox{mean}(X))^2$$
Standard deviation gives a sens of how dispersed (spread out) the data in your sample is from the mean.
quantile(X, c(0.25, 0.75)) first and third quartiles, smallest values such that at least 25% and 75% of the values are smaller or equal.
IQR(X) interquartile range is the span between the first and third quartiles

Example: Description of the number of risk factors in the Bosson dataset.

R> Ri<-B$risk
R> table(Ri)
Ri
 0  1  2  3  4  5 
40 81 61 22  4  1 
R> round(prop.table(table(Ri)),3)
Ri
    0     1     2     3     4     5 
0.191 0.388 0.292 0.105 0.019 0.005

Interpretation: 10.5% of patients have three risk factors.

Note that the relative frequencies are rounded with only 3 digits (function round).

Computation of the mean, standard deviation number of risk factors:

R> mean(Ri)
[1] 1.38756
R> sd(Ri)
[1] 1.003857

Interpretation: The average number of risk factor is 1.39. The standard deviation is 1, this is large compared to the range of possible values (from 0 to 5 risk factors). A large standard deviation means that the data are not concentrated around the mean.

Let us now compute the quartiles.

R> median(Ri)
[1] 1
R> quantile(Ri, c(0.25, 0.75))
25% 75% 
  1   2

Interpretation: The median is 1: at least 50% of the patients have more than one risk factor. The third quartile is 2: at least 75% of the patients have less than 2 risk factors.

The function summary computes the three quartiles, the mean, the minimum and maximum values, and the number of missing values.

R>  summary(Ri)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   1.000   1.388   2.000   5.000

A discrete variable can be represented by a barplot.

Code R :

R> barplot(table(Ri), ylab="absolute frequency", xlab="number of risk factors")
R> barplot(prop.table(table(Ri)), ylab="proportion", xlab="number of risk factors")

Résultat :

Interpretation: The mode of the variable is 1 risk factor. The distribution is not symmetric.

Continous quantitative variable

Let X be a continuous quantitative variable.

A continuous quantitative variable can be described by

mean(X) empirical mean: arithmetic average of the variable, sum of all the values divided by the total sample size
median(X) smallest value such that at least 50% of the values are smaller or equal, middle point of ordered data
var(X), sd(X) empirical variance, standard-deviation: standard deviation can be understood as an average distance from the values to their mean, it reflects the variability of the sample. It is formally defined as the square root of the variance, the variance being the mean of squares minus the square of mean, that difference being multiplied by n/(n-1).
quantile(X, c(0.25, 0.75)) first and third quartiles, smallest values such that at least 25% and 75% of the values are smaller or equal.
IQR(X)interquartile range is the span between the first and third quartiles

Example: Description of patient body mass index (bmi) in the Bosson data set.

R> b<-B$bmi
R> summary(b)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.33   19.23   22.21   22.76   26.37   36.16 
R> sd(b)
[1] 4.294087
R> IQR(b)
[1] 7.138

Interpretation: The mean bmi is 22.76 and the standard deviation (average distance to the mean) is 4.29. The data are rather dispersed. 50% of the population has a bmi less than 22.21, only 25% of the population has a bmi larger than 26.37.

A continuous quantitative variable can be represented by

boxplot(X) a boxplot: a graphical representation of median and quartiles. It gives an overview of the distribution of the data. [Alt text]{/path/to/boxplot.pdf}
plot(ecdf(X)) the empirical cumulative distribution function that displays the quantiles
hist(X) a histogram: the scale of the variable is divided to consecutive intervals of equal length and the number of observations in each interval is counted.
hist(X, prob= TRUE) a density histogram: the bar areas sum to 1. It gives a rough sense of the distribution of the data. Distributions can be symmetric, unimodal, right skewed, left skewed, bimodal, etc.

Code R :

R> par(mfrow=c(2,2))
R> hist(rnorm(1000, 10,1), prob=T, xlab="", main = "symmetric", cex.main=3)
R> hist(rexp(5000, 10), prob=T, xlab="", main = "right skewed", cex.main=3)
R> hist(1-rexp(5000, 10), prob=T, xlab="", main = "left skewed", cex.main=3)
R> hist(c(rnorm(1000, 10,1), rnorm(1000, 15,1)), prob=T, xlab="", main = "bimodal", cex.main=3)

Résultat :

Example: Distribution of the variable bmi in the Bosson dataset.

Interpretation: the limits of the box are the first and third quartiles, the bold line is the median. The outliers are represented with circles and defined as values less than 3/2 times of first quartile or vales more than 3/2 times of third quartile. In this example, there is no outliers. We can see that the distribution is almost symmetric.

Interpretation: The variable bmi is right skewed, unimodal.

Code R :

R> plot(ecdf(b))
R> abline(v=median(b), col="red")
R> abline(h=0.5, col="red")
R> abline(v=quantile(b, 0.25), col="green")
R> abline(h=0.25, col="green")
R> abline(v=quantile(b, 0.75), col="blue")
R> abline(h=0.75, col="blue")

Résultat :

One can read the median (red), the first and third quartiles (green and blue) on the ecdf plot.

Summarizing/presenting couples of variables

Two qualitative or discrete variables

Let (X, Y) be a couple of discrete variables, observed on the same population. Some definitions to describe the distribution of the two variables

The joint absolute frequency of a couple of values (x, y) is the number of individuals for which X is x and Y is y.
table(X,Y) : contingency table, a rectangular array containing joint absolute frequencies.
prop.table(table(X,Y)): joint relative frequency of a couple of values (x, y) is the proportion of individuals for which X is x and Y is y. It is the joint absolute frequency divided by the sample size n.
prop.table(table(X,Y),1): conditional frequency of a value y of Y, given a value x of X, is the joint absolute frequency of (x, y), divided by the absolute frequency of x; or else, the proportion of individuals for which Y is y among those for which X is x.
The two variables are associated if the conditional distributions of one given the value of the other are different.

Example: Joint distribution of variables country and risk factors of the Bosson dataset.

R> C<-B$country
R> Ri<-B$risk
R> table(C,Ri)
         Ri
C          0  1  2  3  4  5
  France  17 29 32 16  4  1
  Vietnam 23 52 29  6  0  0
R> prop.table(table(C,Ri)) # marginal distribution
         Ri
C                   0           1           2           3           4
  France  0.081339713 0.138755981 0.153110048 0.076555024 0.019138756
  Vietnam 0.110047847 0.248803828 0.138755981 0.028708134 0.000000000
         Ri
C                   5
  France  0.004784689
  Vietnam 0.000000000

Interpretation: The proportion of patients from Vietnam and without any risk factor is 11%.

A barplot represents the joint distribution with the option beside.

Code R :

R> barplot(table(C,Ri), beside=TRUE, legend.text=c("France", "Vietnam"))

Résultat :

Interpretation: the mode is not the same for the two distributions, the French mode is 2, the Vietnamese mode is 1.

The conditional distribution is informative when studying the association between the two variables:

R> C<-B$country
R> Ri<-B$risk
R> prop.table(table(C,Ri),1) # row percentages
         Ri
C                  0          1          2          3          4          5
  France  0.17171717 0.29292929 0.32323232 0.16161616 0.04040404 0.01010101
  Vietnam 0.20909091 0.47272727 0.26363636 0.05454545 0.00000000 0.00000000
R> prop.table(table(C,Ri),2) # column percentages
         Ri
C                 0         1         2         3         4         5
  France  0.4250000 0.3580247 0.5245902 0.7272727 1.0000000 1.0000000
  Vietnam 0.5750000 0.6419753 0.4754098 0.2727273 0.0000000 0.0000000

The proportion of patients with 2 risk factors among French patients is 32.32%. The proportion of Vietnamese patients among patients with 1 risk factor is 64.19%.

The visualisation of the conditional frequencies shows that the conditional distributions are slightly different: Code R :

R> TCRi<-prop.table(table(C,Ri),1) # row percentages
R> barplot(TCRi, beside=TRUE, legend.text=c("France", "Vietnam"))

Résultat :

A qualitative and a quantitative variable

Let X be a qualitative or discrete variable and Y a continuous variable, observed on the same population. Some definitions to describe the distribution of the two variables

by(Y,X, summary): conditional distribution of Y, given a value x of X
boxplot(Y~X): boxplot of the conditional distribution of Y, given a value x of X

Example: Conditional distribution of variables bmi given the country, from the Bosson dataset.

R> b<-B$bmi
R> by(b, C, summary)
C: France
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.82   23.97   26.22   25.94   27.43   36.16 
------------------------------------------------------------ 
C: Vietnam
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.33   18.08   19.75   19.90   22.01   27.34

The mean of bmi among French patients is 25.94, while the mean among Vietnamese patients is 19.90. The two conditional distributions are rather different. This is confirmed by the two boxplots of the bmi given the country:

Code R :

R> boxplot(b~C)

Résultat :

Two quantitative variables

Let (X, Y) be a couple of continuous variables, observed on the same population.

plot(X,Y): the scatter plot represents couples of values, understood as coordinates of points in the plane.
cov(X,Y): covariance of X and Y, the mean of products minus the product of means, that difference being multiplied by n/(n-1), where n is the sample size.
cor(X,Y): correlation coefficient of X and Y is the covariance, divided by the product of standard deviations. It is a dimensionless quantity, always between −1 and +1, which describes the relationship between the two variables. If the correlation is positive, the two variables tend to vary in the same direction: if one increases, the other one does too. If the correlation is negative, the variables vary in opposite directions. The closer the correlation is to plus or minus 1, the closer the scatter plot to a straight line, and the stronger the association of the two variables.

Example: Correlation of the bmi and the aneurysm size from the Bosson dataset:

R> b<-B$bmi
R> A<-B$aneurysm
R> cor(b,A)
[1] 0.2386002

The correlation is medium and non-negative, as confirmed by the scatter plot: Code R :

R> plot(b,A)

Résultat :