Data Manipulation in R

Installing packages

We will start out by installing and loading the very usefull collection of R packages designed for data analysis, called tidyverse. All packages share an underlying design philosophy, grammar, and data structures.

install.packages('tidyverse')

library('tidyverse')

## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

For this data manipulation workflow, we’ll use the boot::melanoma package. The data consist of measurements made on patients with malignant melanoma. Each patient had their tumour removed by surgery at the Department of Plastic Surgery, University Hospital of Odense, Denmark during the period 1962 to 1977, and is documented in ?boot.

install.packages('boot')

library('boot')

First look at the data

The first step would be to discover what’s in the data.

head function to view first few rows from the data set.

head(melanoma)

##   time status sex age year thickness ulcer
## 1   10      3   1  76 1972      6.76     1
## 2   30      3   1  56 1968      0.65     0
## 3   35      2   1  41 1977      1.34     0
## 4   99      3   0  71 1968      2.90     0
## 5  185      1   1  52 1965     12.08     1
## 6  204      1   1  28 1971      4.84     1

glimpse to view the structure of the imported data

glimpse(melanoma)

## Observations: 205
## Variables: 7
## $ time      <dbl> 10, 30, 35, 99, 185, 204, 210, 232, 232, 279, 295, 355…
## $ status    <dbl> 3, 3, 2, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, …
## $ sex       <dbl> 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, …
## $ age       <dbl> 76, 56, 41, 71, 52, 28, 77, 60, 49, 68, 53, 64, 68, 63…
## $ year      <dbl> 1972, 1968, 1977, 1968, 1965, 1971, 1972, 1974, 1968, …
## $ thickness <dbl> 6.76, 0.65, 1.34, 2.90, 12.08, 4.84, 5.16, 3.22, 12.88…
## $ ulcer     <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

summary to get some descriptive statistics

summary(melanoma)

##       time          status          sex              age       
##  Min.   :  10   Min.   :1.00   Min.   :0.0000   Min.   : 4.00  
##  1st Qu.:1525   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:42.00  
##  Median :2005   Median :2.00   Median :0.0000   Median :54.00  
##  Mean   :2153   Mean   :1.79   Mean   :0.3854   Mean   :52.46  
##  3rd Qu.:3042   3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:65.00  
##  Max.   :5565   Max.   :3.00   Max.   :1.0000   Max.   :95.00  
##       year        thickness         ulcer      
##  Min.   :1962   Min.   : 0.10   Min.   :0.000  
##  1st Qu.:1968   1st Qu.: 0.97   1st Qu.:0.000  
##  Median :1970   Median : 1.94   Median :0.000  
##  Mean   :1970   Mean   : 2.92   Mean   :0.439  
##  3rd Qu.:1972   3rd Qu.: 3.56   3rd Qu.:1.000  
##  Max.   :1977   Max.   :17.42   Max.   :1.000

Data Manipulation using `dplyr`

The powerfull package dplyr aims to provide a function for each basic step of data manipulation. The functions aids in performing most of the typical data manipulation operations:

filter() to select cases based on their values.
arrange() to reorder the cases.
select() and rename() to select or rename variables based on their names.
mutate() and transmute() to add new variables that are functions of existing variables.
summarise() to condense multiple values to a single value.
sample_n() and sample_frac() to take random samples.
pipe operator(%>%) to pipe together operations.

Filter rows with `filter()`

filter() allows you to select a subset of rows in a data frame. Like all dplyr functions, the first argument is the tibble (or data frame). The second and subsequent arguments refer to variables within that data frame, selecting rows where the expression is TRUE.

For example, we can select all male subjects with age higher than or qual to 60:

head(filter(melanoma, age>=60 & sex == 1))

##   time status sex age year thickness ulcer
## 1   10      3   1  76 1972      6.76     1
## 2  210      1   1  77 1972      5.16     1
## 3  426      1   1  63 1970      4.84     1
## 4  493      3   1  72 1971     12.56     1
## 5  621      1   1  72 1972      7.06     1
## 6  629      1   1  95 1968      5.48     1

We use head() to only display the first 6 rows

R provides a suite of operators that can be used when filtering data, including >, >=, <, <=, !=, ==, & (and), | (or)

Arrange rows with `arrange()`

arrange() works similarly to filter() except that instead of filtering or selecting rows, it reorders them. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For instance we can arrange the melanoma dataframe by age first and subsequently by ulcer thinkness:

head(arrange(melanoma, age, thickness))

##   time status sex age year thickness ulcer
## 1 3385      2   0   4 1968      2.74     0
## 2 3776      2   1  12 1967      7.09     1
## 3  469      1   0  14 1969      2.42     1
## 4 1710      2   1  15 1973      0.58     0
## 5  858      1   0  16 1967      3.56     0
## 6 4479      2   0  19 1965      1.13     1

We can use desc() to order a column in descending order:

head(arrange(melanoma, desc(age)))

##   time status sex age year thickness ulcer
## 1  629      1   1  95 1968      5.48     1
## 2  667      1   0  89 1968     13.85     1
## 3  826      3   0  86 1965      8.54     1
## 4 1970      2   1  84 1972      4.09     1
## 5 1690      1   1  83 1971      1.62     0
## 6 1919      2   1  83 1972      3.54     0

Select columns with `select()`

Often you work with large datasets with many columns but only a few are actually of interest. select() allows you to zoom in on a useful subset using operations that usually only work on numeric variable positions:

select() is used for choosing variables based on the subset criteria. For instance to display the variables sex, age and year:

head(select(melanoma, sex, age, year))

##   sex age year
## 1   1  76 1972
## 2   1  56 1968
## 3   1  41 1977
## 4   0  71 1968
## 5   1  52 1965
## 6   1  28 1971

or we can select all variables between age an ulcer:

head(select(melanoma, age:ulcer))

##   age year thickness ulcer
## 1  76 1972      6.76     1
## 2  56 1968      0.65     0
## 3  41 1977      1.34     0
## 4  71 1968      2.90     0
## 5  52 1965     12.08     1
## 6  28 1971      4.84     1

or all variables except sex and age:

head(select(melanoma, -(sex:age)))

##   time status year thickness ulcer
## 1   10      3 1972      6.76     1
## 2   30      3 1968      0.65     0
## 3   35      2 1977      1.34     0
## 4   99      3 1968      2.90     0
## 5  185      1 1965     12.08     1
## 6  204      1 1971      4.84     1

To rename columns we can use rename():

head(rename(melanoma, ulceration = ulcer))

##   time status sex age year thickness ulceration
## 1   10      3   1  76 1972      6.76          1
## 2   30      3   1  56 1968      0.65          0
## 3   35      2   1  41 1977      1.34          0
## 4   99      3   0  71 1968      2.90          0
## 5  185      1   1  52 1965     12.08          1
## 6  204      1   1  28 1971      4.84          1

Add new columns with `mutate()`

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This can be done using the function mutate() For instance we can create the new column sex_str where we specify the sex as a string, either male or female:

head(mutate(melanoma, sex_str = ifelse(sex==0, 'female', 'male')))

##   time status sex age year thickness ulcer sex_str
## 1   10      3   1  76 1972      6.76     1    male
## 2   30      3   1  56 1968      0.65     0    male
## 3   35      2   1  41 1977      1.34     0    male
## 4   99      3   0  71 1968      2.90     0  female
## 5  185      1   1  52 1965     12.08     1    male
## 6  204      1   1  28 1971      4.84     1    male

or we can create a new column of tumour thickness in cm:

head(mutate(melanoma, thickness_cm = thickness/10))

##   time status sex age year thickness ulcer thickness_cm
## 1   10      3   1  76 1972      6.76     1        0.676
## 2   30      3   1  56 1968      0.65     0        0.065
## 3   35      2   1  41 1977      1.34     0        0.134
## 4   99      3   0  71 1968      2.90     0        0.290
## 5  185      1   1  52 1965     12.08     1        1.208
## 6  204      1   1  28 1971      4.84     1        0.484

mutate() appends the new variable to the data frame. If you only want to keep the new variables, use transmute():

head(transmute(melanoma, sex_str = ifelse(sex==0, 'female', 'male'), thickness_cm = thickness/10))

##   sex_str thickness_cm
## 1    male        0.676
## 2    male        0.065
## 3    male        0.134
## 4  female        0.290
## 5    male        1.208
## 6    male        0.484

Pipe operator

When carrying out function call inside of other functions it can become a bit awkward and difficult to read and manage. To circumvent this we can use the
dplyr pipe operator %>%. It is used to tie multiple operations together. This makes it easy, especially when we need to perform various operations on a dataset to derive the results.

For instance we can read melanoma %>% select(sex, age) from left to right — from the melanoma dataset, select sex and age variables. Meaning that we pipe the output from one call as input for the next.

melanoma %>% select(sex, age) %>% filter(age >= 50 & sex == 1) %>% head()

##   sex age
## 1   1  76
## 2   1  56
## 3   1  52
## 4   1  77
## 5   1  63
## 6   1  72

Using pipe in the above makes the code easier to read compared to:

head(select(filter(melanoma, age >= 50 & sex == 1 ), sex,age))

##   sex age
## 1   1  76
## 2   1  56
## 3   1  52
## 4   1  77
## 5   1  63
## 6   1  72

Group and summarise values with `summarise()` and `group_by`

group_by is used to group data together based on one or more columns. It is mostly useful when used along with a summarizing function to derive aggregated values for each group. For instance we can group by sex and count the observation by using the tally() function:

melanoma %>% group_by(sex) %>% tally()

## # A tibble: 2 x 2
##     sex     n
##   <dbl> <int>
## 1     0   126
## 2     1    79

The power of group_by() comes when it is used together with the summarize()function which collapses each group into a single-row summary of that group by applying an aggregating or summary function to each group: So to view the mean tumour thickness for each sex:

melanoma %>% group_by(sex) %>% summarise(mean_thickness = mean(thickness))

## # A tibble: 2 x 2
##     sex mean_thickness
##   <dbl>          <dbl>
## 1     0           2.49
## 2     1           3.61

or we can calculate some statistic measures for the parameter sex.

melanoma %>% 
  group_by(sex) %>%
  summarise(mean_thickness = mean(thickness), 
            sd_thickness = sd(thickness), 
            n_thickness = n(),  
            SE_thickness = sd(thickness)/sqrt(n())) %>% 
            mutate(lower_ci = mean_thickness - qt(1 - (0.05 / 2), n_thickness - 1) * SE_thickness,
            upper_ci = mean_thickness + qt(1 - (0.05 / 2), n_thickness - 1) * SE_thickness)

## # A tibble: 2 x 7
##     sex mean_thickness sd_thickness n_thickness SE_thickness lower_ci
##   <dbl>          <dbl>        <dbl>       <int>        <dbl>    <dbl>
## 1     0           2.49         2.75         126        0.245     2.00
## 2     1           3.61         3.16          79        0.355     2.90
## # … with 1 more variable: upper_ci <dbl>

Randomly sample rows with `sample_n()` and `sample_frac()`

You can use sample_n() and sample_frac() to take a random sample of rows: use sample_n() for a fixed number and sample_frac() for a fixed fraction

melanoma %>% sample_n(10)

##    time status sex age year thickness ulcer
## 1  1648      2   0  55 1973      1.29     0
## 2  1933      1   0  77 1972      1.94     0
## 3  1075      1   1  66 1971      3.54     1
## 4  4688      2   0  42 1965      0.48     0
## 5  1512      2   0  77 1973      0.16     0
## 6  1548      1   0  61 1972      1.62     0
## 7   858      1   0  16 1967      3.56     0
## 8  1525      3   0  76 1970      1.29     1
## 9  2075      2   1  55 1972      2.58     1
## 10  185      1   1  52 1965     12.08     1

melanoma %>% sample_frac(0.1)

##    time status sex age year thickness ulcer
## 1  4001      2   0  69 1967      2.10     0
## 2  2062      1   0  52 1965      3.06     0
## 3  3154      3   1  49 1969      1.62     0
## 4  3667      2   0  42 1967      3.22     0
## 5  1654      2   1  20 1973      0.97     0
## 6  2426      2   0  69 1971      3.87     0
## 7  2156      2   0  45 1972      0.97     0
## 8  2460      2   0  47 1971      0.64     0
## 9  4926      2   0  50 1964      2.26     0
## 10  204      1   1  28 1971      4.84     1
## 11  833      1   0  56 1971      2.58     1
## 12 3278      2   1  54 1969      2.58     0
## 13  279      1   0  68 1971      7.41     1
## 14 1634      2   0  68 1973      1.37     0
## 15 1641      2   0  57 1973      0.81     0
## 16  210      1   1  77 1972      5.16     1
## 17 1252      1   0  58 1971      3.87     1
## 18  355      3   0  64 1972      0.16     1
## 19 2227      2   0  51 1971      0.10     0
## 20 2108      1   0  58 1969      1.76     1

Use replace = TRUE to perform a bootstrap sample. If needed, you can weight the sample with the weight argument.

Tidying data using `tidyr`

The goal of tidyr is to help you create tidy data. Tidy data is data where:

Each variable is in a column.
Each observation is a row.
Each value is a cell.

Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis. A cheat sheet to tidyr can be found here

There are four major function that can be used to tidying data:

gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.
spread() takes two columns (key & value), and spreads into multiple columns: it makes “long” data wider.
``separate()``` splits a column into multiple columns.
unite() does the reverse of separate. It unites multiple columns into single column

separate() splits apart one column into multiple columns, by splitting wherever a separator character appears. E.g., we can split the variable year into two variables, called century and year. Here we specify in the seperator that we want to seperate at the second position of the year.

melanoma %>% separate(year, into = c('century', 'year'), sep = 2) %>% head()

##   time status sex age century year thickness ulcer
## 1   10      3   1  76      19   72      6.76     1
## 2   30      3   1  56      19   68      0.65     0
## 3   35      2   1  41      19   77      1.34     0
## 4   99      3   0  71      19   68      2.90     0
## 5  185      1   1  52      19   65     12.08     1
## 6  204      1   1  28      19   71      4.84     1

Subsequently we can append the 00 to the century with mutate().

df <- melanoma %>% separate(year, into = c('century', 'year'), sep = 2)
head(df)

##   time status sex age century year thickness ulcer
## 1   10      3   1  76      19   72      6.76     1
## 2   30      3   1  56      19   68      0.65     0
## 3   35      2   1  41      19   77      1.34     0
## 4   99      3   0  71      19   68      2.90     0
## 5  185      1   1  52      19   65     12.08     1
## 6  204      1   1  28      19   71      4.84     1

With the function unite(), we can reverse the seperation.

df %>% unite(year, century, year) %>% head()

##   time status sex age  year thickness ulcer
## 1   10      3   1  76 19_72      6.76     1
## 2   30      3   1  56 19_68      0.65     0
## 3   35      2   1  41 19_77      1.34     0
## 4   99      3   0  71 19_68      2.90     0
## 5  185      1   1  52 19_65     12.08     1
## 6  204      1   1  28 19_71      4.84     1

Data visualization using `ggplot2`

It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (i.e. plot types like geom_point() to create a scatterplot or geom_histogram()to create a histogram), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).

The general syntax of using ggplot2 will look like this:

ggplot(data = <default data set>, 
       aes(x = <default x axis variable>,
           y = <default y axis variable>,
           ... <other default aesthetic mappings>),
       ... <other plot defaults>) +

       geom_<geom type>(aes(size = <size variable for this geom>, 
                      ... <other aesthetic mappings>),
                  data = <data for this point geom>,
                  stat = <statistic string or function>,
                  position = <position string or function>,
                  color = <"fixed color specification">,
                  <other arguments, possibly passed to the _stat_ function) +

  scale_<aesthetic>_<type>(name = <"scale label">,
                     breaks = <where to put tick marks>,
                     labels = <labels for tick marks>,
                     ... <other options for the scale>) +

  theme(plot.background = element_rect(fill = "gray"),
        ... <other theme elements>)

It might seem a little daunting but essentially one can create usefull plots by understanding just the first, and most important, couple of layers.

Check out the ggplot cheatsheet here

We start out by specifying the dataset we are working with to the ggplot function. Then we supply the variables that we want to plot in the aes() function:

ggplot(data = melanoma, mapping = aes(x = thickness))

Next we can add the next layer, the plot type. All plot types starts with geom_. We pipe together the layers using +, similarly to the pipe operator previously used.

ggplot(data = melanoma, mapping = aes(x = thickness)) + geom_histogram()

In the geom_histogram() layer we can specify the color and fill color of bins. We can add axis labels and title with xlab(), ylab and ggtitle().

ggplot(data = melanoma, mapping = aes(x = thickness)) + geom_histogram(color = 'Blue', fill = 'Lightblue') + xlab('Tumour thickness in mm') + ylab('Count') + ggtitle('Histogram of tumour thickness')

We can use geom_boxplot() to create a boxplot depicting the categorical variable status against the numerical age. To get a variable description of the melanoma dataset you can type ?melanoma in the console. We can color the categorical groups by supplying the aes()function:

ggplot(data = melanoma, mapping = aes(x = factor(status), y = age)) + geom_boxplot(aes(fill=factor(status))) + xlab('Patient status') + ylab('Patient age')

To create a scatter plot we supply the geom_point() function. We can create a scatter plot of the numerical variables age and thickness colored by sex:

ggplot(data = melanoma, mapping = aes(x = age, y = thickness)) + geom_point(aes(color=factor(sex))) + xlab('Patient age') + ylab('Tumour thickness')

Reading data using `readr`

The goal of readr is to provide a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. To accurately read a rectangular dataset with readr you combine two pieces: a function that parses the overall file, and a column specification. The column specification describes how each column should be converted from a character vector to the most appropriate data type, and in most cases it’s not necessary because readr will guess it for you automatically.

readr supports seven file formats with seven read_ functions:

read_csv(): comma separated (CSV) files
read_tsv(): tab separated files
read_delim(): general delimited files
read_fwf(): fixed width files
read_table(): tabular files where columns are separated by white-space.
read_log(): web log files

In many cases, these functions will just work: you supply the path to a file and you get a tibble dataframa back. The following example loads a sample file bundled with readr:

read_csv('test.csv',col_names = TRUE)

You can also specify the data type of every column loaded in data using the code below:

read_csv("iris.csv", col_types = list(
      Sepal.Length = col_double(),
      Sepal.Width = col_double(),
      Petal.Length = col_double(),
      Petal.Width = col_double(),
      Species = col_factor(c("setosa", "versicolor", "virginica"))

Data Manipulation in R

Simon Kønig

21/02/2020

Installing packages

First look at the data

Data Manipulation using `dplyr`

Filter rows with `filter()`

Arrange rows with `arrange()`

Select columns with `select()`

Add new columns with `mutate()`

Pipe operator

Group and summarise values with `summarise()` and `group_by`

Randomly sample rows with `sample_n()` and `sample_frac()`

Tidying data using `tidyr`

Data visualization using `ggplot2`

Reading data using `readr`

References

Data Manipulation in R

Simon Kønig

21/02/2020

Installing packages

First look at the data

Data Manipulation using dplyr

Filter rows with filter()

Arrange rows with arrange()

Select columns with select()

Add new columns with mutate()

Pipe operator

Group and summarise values with summarise() and group_by

Randomly sample rows with sample_n() and sample_frac()

Tidying data using tidyr

Data visualization using ggplot2

Reading data using readr

References

Data Manipulation using `dplyr`

Filter rows with `filter()`

Arrange rows with `arrange()`

Select columns with `select()`

Add new columns with `mutate()`

Group and summarise values with `summarise()` and `group_by`

Randomly sample rows with `sample_n()` and `sample_frac()`

Tidying data using `tidyr`

Data visualization using `ggplot2`

Reading data using `readr`