We will start out by installing and loading the very usefull collection of R packages designed for data analysis, called tidyverse
. All packages share an underlying design philosophy, grammar, and data structures.
install.packages('tidyverse')
library('tidyverse')
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
For this data manipulation workflow, we’ll use the boot::melanoma
package. The data consist of measurements made on patients with malignant melanoma. Each patient had their tumour removed by surgery at the Department of Plastic Surgery, University Hospital of Odense, Denmark during the period 1962 to 1977, and is documented in ?boot
.
install.packages('boot')
library('boot')
The first step would be to discover what’s in the data.
head
function to view first few rows from the data set.
head(melanoma)
## time status sex age year thickness ulcer
## 1 10 3 1 76 1972 6.76 1
## 2 30 3 1 56 1968 0.65 0
## 3 35 2 1 41 1977 1.34 0
## 4 99 3 0 71 1968 2.90 0
## 5 185 1 1 52 1965 12.08 1
## 6 204 1 1 28 1971 4.84 1
glimpse
to view the structure of the imported data
glimpse(melanoma)
## Observations: 205
## Variables: 7
## $ time <dbl> 10, 30, 35, 99, 185, 204, 210, 232, 232, 279, 295, 355…
## $ status <dbl> 3, 3, 2, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, …
## $ sex <dbl> 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, …
## $ age <dbl> 76, 56, 41, 71, 52, 28, 77, 60, 49, 68, 53, 64, 68, 63…
## $ year <dbl> 1972, 1968, 1977, 1968, 1965, 1971, 1972, 1974, 1968, …
## $ thickness <dbl> 6.76, 0.65, 1.34, 2.90, 12.08, 4.84, 5.16, 3.22, 12.88…
## $ ulcer <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
summary
to get some descriptive statistics
summary(melanoma)
## time status sex age
## Min. : 10 Min. :1.00 Min. :0.0000 Min. : 4.00
## 1st Qu.:1525 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:42.00
## Median :2005 Median :2.00 Median :0.0000 Median :54.00
## Mean :2153 Mean :1.79 Mean :0.3854 Mean :52.46
## 3rd Qu.:3042 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:65.00
## Max. :5565 Max. :3.00 Max. :1.0000 Max. :95.00
## year thickness ulcer
## Min. :1962 Min. : 0.10 Min. :0.000
## 1st Qu.:1968 1st Qu.: 0.97 1st Qu.:0.000
## Median :1970 Median : 1.94 Median :0.000
## Mean :1970 Mean : 2.92 Mean :0.439
## 3rd Qu.:1972 3rd Qu.: 3.56 3rd Qu.:1.000
## Max. :1977 Max. :17.42 Max. :1.000
dplyr
The powerfull package dplyr
aims to provide a function for each basic step of data manipulation. The functions aids in performing most of the typical data manipulation operations:
filter()
to select cases based on their values.arrange()
to reorder the cases.select()
and rename()
to select or rename variables based on their names.mutate()
and transmute()
to add new variables that are functions of existing variables.summarise()
to condense multiple values to a single value.sample_n()
and sample_frac()
to take random samples.pipe operator(%>%)
to pipe together operations.filter()
filter()
allows you to select a subset of rows in a data frame. Like all dplyr
functions, the first argument is the tibble (or data frame). The second and subsequent arguments refer to variables within that data frame, selecting rows where the expression is TRUE
.
For example, we can select all male subjects with age higher than or qual to 60:
head(filter(melanoma, age>=60 & sex == 1))
## time status sex age year thickness ulcer
## 1 10 3 1 76 1972 6.76 1
## 2 210 1 1 77 1972 5.16 1
## 3 426 1 1 63 1970 4.84 1
## 4 493 3 1 72 1971 12.56 1
## 5 621 1 1 72 1972 7.06 1
## 6 629 1 1 95 1968 5.48 1
We use head() to only display the first 6 rows
R
provides a suite of operators that can be used when filtering data, including >, >=, <, <=, !=, ==, & (and), | (or)
arrange()
arrange()
works similarly to filter()
except that instead of filtering or selecting rows, it reorders them. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For instance we can arrange the melanoma dataframe by age first and subsequently by ulcer thinkness:
head(arrange(melanoma, age, thickness))
## time status sex age year thickness ulcer
## 1 3385 2 0 4 1968 2.74 0
## 2 3776 2 1 12 1967 7.09 1
## 3 469 1 0 14 1969 2.42 1
## 4 1710 2 1 15 1973 0.58 0
## 5 858 1 0 16 1967 3.56 0
## 6 4479 2 0 19 1965 1.13 1
We can use desc()
to order a column in descending order:
head(arrange(melanoma, desc(age)))
## time status sex age year thickness ulcer
## 1 629 1 1 95 1968 5.48 1
## 2 667 1 0 89 1968 13.85 1
## 3 826 3 0 86 1965 8.54 1
## 4 1970 2 1 84 1972 4.09 1
## 5 1690 1 1 83 1971 1.62 0
## 6 1919 2 1 83 1972 3.54 0
select()
Often you work with large datasets with many columns but only a few are actually of interest. select()
allows you to zoom in on a useful subset using operations that usually only work on numeric variable positions:
select()
is used for choosing variables based on the subset criteria. For instance to display the variables sex, age and year:
head(select(melanoma, sex, age, year))
## sex age year
## 1 1 76 1972
## 2 1 56 1968
## 3 1 41 1977
## 4 0 71 1968
## 5 1 52 1965
## 6 1 28 1971
or we can select all variables between age an ulcer:
head(select(melanoma, age:ulcer))
## age year thickness ulcer
## 1 76 1972 6.76 1
## 2 56 1968 0.65 0
## 3 41 1977 1.34 0
## 4 71 1968 2.90 0
## 5 52 1965 12.08 1
## 6 28 1971 4.84 1
or all variables except sex and age:
head(select(melanoma, -(sex:age)))
## time status year thickness ulcer
## 1 10 3 1972 6.76 1
## 2 30 3 1968 0.65 0
## 3 35 2 1977 1.34 0
## 4 99 3 1968 2.90 0
## 5 185 1 1965 12.08 1
## 6 204 1 1971 4.84 1
To rename columns we can use rename()
:
head(rename(melanoma, ulceration = ulcer))
## time status sex age year thickness ulceration
## 1 10 3 1 76 1972 6.76 1
## 2 30 3 1 56 1968 0.65 0
## 3 35 2 1 41 1977 1.34 0
## 4 99 3 0 71 1968 2.90 0
## 5 185 1 1 52 1965 12.08 1
## 6 204 1 1 28 1971 4.84 1
mutate()
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This can be done using the function mutate()
For instance we can create the new column sex_str where we specify the sex as a string, either male or female:
head(mutate(melanoma, sex_str = ifelse(sex==0, 'female', 'male')))
## time status sex age year thickness ulcer sex_str
## 1 10 3 1 76 1972 6.76 1 male
## 2 30 3 1 56 1968 0.65 0 male
## 3 35 2 1 41 1977 1.34 0 male
## 4 99 3 0 71 1968 2.90 0 female
## 5 185 1 1 52 1965 12.08 1 male
## 6 204 1 1 28 1971 4.84 1 male
or we can create a new column of tumour thickness in cm:
head(mutate(melanoma, thickness_cm = thickness/10))
## time status sex age year thickness ulcer thickness_cm
## 1 10 3 1 76 1972 6.76 1 0.676
## 2 30 3 1 56 1968 0.65 0 0.065
## 3 35 2 1 41 1977 1.34 0 0.134
## 4 99 3 0 71 1968 2.90 0 0.290
## 5 185 1 1 52 1965 12.08 1 1.208
## 6 204 1 1 28 1971 4.84 1 0.484
mutate()
appends the new variable to the data frame. If you only want to keep the new variables, use transmute()
:
head(transmute(melanoma, sex_str = ifelse(sex==0, 'female', 'male'), thickness_cm = thickness/10))
## sex_str thickness_cm
## 1 male 0.676
## 2 male 0.065
## 3 male 0.134
## 4 female 0.290
## 5 male 1.208
## 6 male 0.484
When carrying out function call inside of other functions it can become a bit awkward and difficult to read and manage. To circumvent this we can use the
dplyr
pipe operator %>%
. It is used to tie multiple operations together. This makes it easy, especially when we need to perform various operations on a dataset to derive the results.
For instance we can read melanoma %>% select(sex, age)
from left to right — from the melanoma dataset, select sex and age variables. Meaning that we pipe the output from one call as input for the next.
melanoma %>% select(sex, age) %>% filter(age >= 50 & sex == 1) %>% head()
## sex age
## 1 1 76
## 2 1 56
## 3 1 52
## 4 1 77
## 5 1 63
## 6 1 72
Using pipe in the above makes the code easier to read compared to:
head(select(filter(melanoma, age >= 50 & sex == 1 ), sex,age))
## sex age
## 1 1 76
## 2 1 56
## 3 1 52
## 4 1 77
## 5 1 63
## 6 1 72
summarise()
and group_by
group_by
is used to group data together based on one or more columns. It is mostly useful when used along with a summarizing function to derive aggregated values for each group. For instance we can group by sex and count the observation by using the tally()
function:
melanoma %>% group_by(sex) %>% tally()
## # A tibble: 2 x 2
## sex n
## <dbl> <int>
## 1 0 126
## 2 1 79
The power of group_by()
comes when it is used together with the summarize()
function which collapses each group into a single-row summary of that group by applying an aggregating or summary function to each group: So to view the mean tumour thickness for each sex:
melanoma %>% group_by(sex) %>% summarise(mean_thickness = mean(thickness))
## # A tibble: 2 x 2
## sex mean_thickness
## <dbl> <dbl>
## 1 0 2.49
## 2 1 3.61
or we can calculate some statistic measures for the parameter sex.
melanoma %>%
group_by(sex) %>%
summarise(mean_thickness = mean(thickness),
sd_thickness = sd(thickness),
n_thickness = n(),
SE_thickness = sd(thickness)/sqrt(n())) %>%
mutate(lower_ci = mean_thickness - qt(1 - (0.05 / 2), n_thickness - 1) * SE_thickness,
upper_ci = mean_thickness + qt(1 - (0.05 / 2), n_thickness - 1) * SE_thickness)
## # A tibble: 2 x 7
## sex mean_thickness sd_thickness n_thickness SE_thickness lower_ci
## <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0 2.49 2.75 126 0.245 2.00
## 2 1 3.61 3.16 79 0.355 2.90
## # … with 1 more variable: upper_ci <dbl>
sample_n()
and sample_frac()
You can use sample_n()
and sample_frac()
to take a random sample of rows: use sample_n()
for a fixed number and sample_frac()
for a fixed fraction
melanoma %>% sample_n(10)
## time status sex age year thickness ulcer
## 1 1648 2 0 55 1973 1.29 0
## 2 1933 1 0 77 1972 1.94 0
## 3 1075 1 1 66 1971 3.54 1
## 4 4688 2 0 42 1965 0.48 0
## 5 1512 2 0 77 1973 0.16 0
## 6 1548 1 0 61 1972 1.62 0
## 7 858 1 0 16 1967 3.56 0
## 8 1525 3 0 76 1970 1.29 1
## 9 2075 2 1 55 1972 2.58 1
## 10 185 1 1 52 1965 12.08 1
melanoma %>% sample_frac(0.1)
## time status sex age year thickness ulcer
## 1 4001 2 0 69 1967 2.10 0
## 2 2062 1 0 52 1965 3.06 0
## 3 3154 3 1 49 1969 1.62 0
## 4 3667 2 0 42 1967 3.22 0
## 5 1654 2 1 20 1973 0.97 0
## 6 2426 2 0 69 1971 3.87 0
## 7 2156 2 0 45 1972 0.97 0
## 8 2460 2 0 47 1971 0.64 0
## 9 4926 2 0 50 1964 2.26 0
## 10 204 1 1 28 1971 4.84 1
## 11 833 1 0 56 1971 2.58 1
## 12 3278 2 1 54 1969 2.58 0
## 13 279 1 0 68 1971 7.41 1
## 14 1634 2 0 68 1973 1.37 0
## 15 1641 2 0 57 1973 0.81 0
## 16 210 1 1 77 1972 5.16 1
## 17 1252 1 0 58 1971 3.87 1
## 18 355 3 0 64 1972 0.16 1
## 19 2227 2 0 51 1971 0.10 0
## 20 2108 1 0 58 1969 1.76 1
Use replace = TRUE
to perform a bootstrap sample. If needed, you can weight the sample with the weight
argument.
tidyr
The goal of tidyr is to help you create tidy data. Tidy data is data where:
Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis. A cheat sheet to tidyr
can be found here
There are four major function that can be used to tidying data:
gather()
takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.spread()
takes two columns (key & value), and spreads into multiple columns: it makes “long” data wider.unite()
does the reverse of separate. It unites multiple columns into single columnseparate()
splits apart one column into multiple columns, by splitting wherever a separator character appears. E.g., we can split the variable year
into two variables, called century
and year
. Here we specify in the seperator that we want to seperate at the second position of the year.
melanoma %>% separate(year, into = c('century', 'year'), sep = 2) %>% head()
## time status sex age century year thickness ulcer
## 1 10 3 1 76 19 72 6.76 1
## 2 30 3 1 56 19 68 0.65 0
## 3 35 2 1 41 19 77 1.34 0
## 4 99 3 0 71 19 68 2.90 0
## 5 185 1 1 52 19 65 12.08 1
## 6 204 1 1 28 19 71 4.84 1
Subsequently we can append the 00 to the century with mutate()
.
df <- melanoma %>% separate(year, into = c('century', 'year'), sep = 2)
head(df)
## time status sex age century year thickness ulcer
## 1 10 3 1 76 19 72 6.76 1
## 2 30 3 1 56 19 68 0.65 0
## 3 35 2 1 41 19 77 1.34 0
## 4 99 3 0 71 19 68 2.90 0
## 5 185 1 1 52 19 65 12.08 1
## 6 204 1 1 28 19 71 4.84 1
With the function unite()
, we can reverse the seperation.
df %>% unite(year, century, year) %>% head()
## time status sex age year thickness ulcer
## 1 10 3 1 76 19_72 6.76 1
## 2 30 3 1 56 19_68 0.65 0
## 3 35 2 1 41 19_77 1.34 0
## 4 99 3 0 71 19_68 2.90 0
## 5 185 1 1 52 19_65 12.08 1
## 6 204 1 1 28 19_71 4.84 1
ggplot2
It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot()
, supply a dataset and aesthetic mapping (with aes()
). You then add on layers (i.e. plot types like geom_point()
to create a scatterplot or geom_histogram()
to create a histogram), scales (like scale_colour_brewer()
), faceting specifications (like facet_wrap()
) and coordinate systems (like coord_flip()
).
The general syntax of using ggplot2 will look like this:
ggplot(data = <default data set>,
aes(x = <default x axis variable>,
y = <default y axis variable>,
... <other default aesthetic mappings>),
... <other plot defaults>) +
geom_<geom type>(aes(size = <size variable for this geom>,
... <other aesthetic mappings>),
data = <data for this point geom>,
stat = <statistic string or function>,
position = <position string or function>,
color = <"fixed color specification">,
<other arguments, possibly passed to the _stat_ function) +
scale_<aesthetic>_<type>(name = <"scale label">,
breaks = <where to put tick marks>,
labels = <labels for tick marks>,
... <other options for the scale>) +
theme(plot.background = element_rect(fill = "gray"),
... <other theme elements>)
It might seem a little daunting but essentially one can create usefull plots by understanding just the first, and most important, couple of layers.
Check out the ggplot cheatsheet here
We start out by specifying the dataset we are working with to the ggplot
function. Then we supply the variables that we want to plot in the aes()
function:
ggplot(data = melanoma, mapping = aes(x = thickness))
Next we can add the next layer, the plot type. All plot types starts with geom_. We pipe together the layers using +
, similarly to the pipe operator previously used.
ggplot(data = melanoma, mapping = aes(x = thickness)) + geom_histogram()
In the
geom_histogram()
layer we can specify the color and fill color of bins. We can add axis labels and title with xlab()
, ylab
and ggtitle()
.
ggplot(data = melanoma, mapping = aes(x = thickness)) + geom_histogram(color = 'Blue', fill = 'Lightblue') + xlab('Tumour thickness in mm') + ylab('Count') + ggtitle('Histogram of tumour thickness')
We can use geom_boxplot()
to create a boxplot depicting the categorical variable status against the numerical age. To get a variable description of the melanoma dataset you can type ?melanoma
in the console. We can color the categorical groups by supplying the aes()
function:
ggplot(data = melanoma, mapping = aes(x = factor(status), y = age)) + geom_boxplot(aes(fill=factor(status))) + xlab('Patient status') + ylab('Patient age')
To create a scatter plot we supply the geom_point()
function. We can create a scatter plot of the numerical variables age
and thickness
colored by sex
:
ggplot(data = melanoma, mapping = aes(x = age, y = thickness)) + geom_point(aes(color=factor(sex))) + xlab('Patient age') + ylab('Tumour thickness')
readr
The goal of readr is to provide a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. To accurately read a rectangular dataset with readr you combine two pieces: a function that parses the overall file, and a column specification. The column specification describes how each column should be converted from a character vector to the most appropriate data type, and in most cases it’s not necessary because readr will guess it for you automatically.
readr supports seven file formats with seven read_
functions:
read_csv()
: comma separated (CSV) filesread_tsv()
: tab separated filesread_delim()
: general delimited filesread_fwf()
: fixed width filesread_table()
: tabular files where columns are separated by white-space.read_log()
: web log filesIn many cases, these functions will just work: you supply the path to a file and you get a tibble dataframa back. The following example loads a sample file bundled with readr:
read_csv('test.csv',col_names = TRUE)
You can also specify the data type of every column loaded in data using the code below:
read_csv("iris.csv", col_types = list(
Sepal.Length = col_double(),
Sepal.Width = col_double(),
Petal.Length = col_double(),
Petal.Width = col_double(),
Species = col_factor(c("setosa", "versicolor", "virginica"))