4.summarise-and-combine-live-coding-script.Rmd

---
title: "Summarising and combining data -- live coding script"
author: "Chandra Chilamakuri, Matt Eldridge and Ashley Sawle"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
  html_document:
    toc: yes
    toc_float: yes
---

## Introduction

We will now turn our attention to some of the more advanced operations you can
perform on data frames using **dplyr**.

<p>
Especially I will be taking you through three important dplyr fucntions.
<p>


- **summarise**
    - Summarising data by applying a function to specified column(s)

- **group_by**
    - Grouping rows with shared or common values prior to summarising for each group

- **join**
    - Joining matching rows from two data frames based on matching values for specified columns


<p>
Before we start summarise and combine operations, let's clean the environment and start fresh.
<p>

### Load the tidyverse

```{r }
library(tidyverse)
```

### Read the patients data into R enveronment

Read in the patients data frame using `read_tsv` from the `readr` package.
`read_tsv` imports tab-delimited files (tsv = tab-separated values).


```{r }

patients <- read_tsv("patient-data-cleaned.txt")

```

## Summarising

We can compute summary statistics for selected columns in our dataset using the
`summarise` verb. For example, we could use `summarise` to calculate the
average height of patients in our `patients` data.

Let's first remind ourselves what the `patients` dataset looks like.


```{r}
patients
```

Now let’s compute the mean height of a patient.


```{r}
summarise(patients, mean(Height))
```

If you prefer Oxford spelling, in which -ize is preferred to -ise, you’re in
luck as dplyr accommodates the alternative spelling.

Note that the result is a data frame consisting in this case of a single row
and a single column, unlike the more usual way of calculating the mean value
for a vector or column, which results in a single numeric value (actually in R
this is numeric vector of length 1).

```{r}
mean(patients$Height)
```

Returning a data frame might be quite useful, particularly if we’re summarising
multiple columns or using more than one function, for example computing the
average and standard deviation.

```{r}

summarise(patients, average = mean(Height), standard_deviation = sd(Height))

```


`summarise` collapses data into a single row of values. Notice how we also
named the output columns in this last example.


`summarise` can take any R function that takes a vector of values and returns a
single value. Some of the more useful functions include:

* `min` minimum value
* `max` maximum value
* `sum` sum of values
* `mean` mean value
* `sd` standard deviation
* `median` median value
* `IQR` the interquartile range
* `n_distinct` the number of distinct values
* `n` the number of observations (*Note: this is a special function that
  doesn't take a vector argument, i.e. column*)

It is also possible to summarise using a function that takes more than one
value, i.e. from multiple columns. For example, we could compute the
correlation between patient height and Weight:

```{r}
summarise(patients, correlation = cor(Height, Weight))
```

### Summarising multiple columns

We can apply the same function to all columns using `summarise_all`.


```{r}
summarise_all(patients, mean)
```

While this seems to work, there are several warnings about inputs to the `mean`
function not being numerical. Can you see why?

A look at the documentation for the `summarise_all` function (type
'?summarise_all' at the command prompt or use the Help viewer in RStudio) shows
that there are two related functions, `summarise_at` and `summarise_if`, either
of which can be used to specify the columns for which we would like to
calculate the average values.

`summarise_at` allows us to select the columns on which to operate using an
additional `vars` argument.


```{r}
summarise_at(patients, vars(Height, Weight), mean)

```


`summarise_if` provides another option and works well in cases where the
operation should be applied to all columns that meet certain criteria. In this
case, we want to calculate the averages for all columns with numeric values.

```{r}
summarise_if(patients, is.numeric, mean)
```

It is possible to summarise using more than one function in which case a list of
functions needs to be provided.

```{r}
summarise_at(patients, vars(Height, Weight), list(mean, sd))
```

of if you want some control over naming the output columns.

```{r}
patients %>%
  select(Height, Weight) %>%
  summarise_all(list(average =  mean, stdev = sd))

```


Just like with the `select` operation, we can instead specify those columns to
exclude.

```{r}

summarise_at(patients, vars(-ID:-Smokes, -State:-Died), mean)

```


Or we can use one of the helper functions to choose which columns to operate
on.


```{r}

summarise_at(patients, vars(starts_with('BM')), mean)

summarise_at(patients, vars(ends_with('ight')), mean)

# use regular expression to select columns on which to operate
summarise_at(patients, vars(matches('[HW]eight')), mean)

# use one_of if you have a vector of column names

columns <- c('Height', 'Weight', 'Age')

summarise_at(patients, vars(one_of(columns)), mean)

# alternatively, and more concisely, just pass the vector of columns names

summarise_at(patients, vars(columns), mean)

```

### Exercise - summarising

See separate R markdown document.

## Grouping

While the `summarise` function is useful on its own, it becomes really powerful
when applied to groups of observations within a dataset. For example, suppose
we want to compute the mean height for patients from different states.  We
could take each state in turn and `filter` the data frame to only contain rows
for a given state, then apply `summarise`, but that would be somewhat
cumbersome. In `dplyr`, the `group_by` function allows this to be done in one
simple step.

This works best in a workflow using the `%>%` pipe symbol, so as quick
reminder, the following are identical ways of computing the average height of
patients.

```{r}

summarise(patients, mean(Height))

patients %>% summarise(mean(Height))

```


Now let’s do the same thing but treating each group separately.

```{r}

patients %>% group_by(State) %>% summarise(mean(Height))
```


As before, we can summarise multiple observations.


```{r}

patients %>% group_by(State) %>% summarise(n(), mean(Height), sd(Height))

```

We can make the output more presentable by renaming the columns and using the
`round` function to round to a specified number of significant figures. Note
the use of backticks (` ) for specifying column names that contain spaces.

```{r}

patients %>%
  group_by(State) %>%
  summarise(
    N = n(),
    `Average height` = mean(Height),
    `Stdev height` = sd(Height)
  ) %>%
  mutate_at(vars(`Average height`, `Stdev height`), round, digits = 1)
```


A shorthand for summarise(n()) for counting the number of observations of each group is available in the form of the count function.

```{r}

patients %>% group_by(State) %>% summarise(N = n())

patients %>% group_by(State) %>% count()

```


`group_by` can also be used in conjunction with other `dplyr` verbs.

```{r}
patients %>% group_by(State) %>% summarise_at(vars(Height, Weight), mean)
```

```{r, fig.width=12}
patients %>%
    group_by(State) %>%
    summarise_at(vars(Height, Weight), mean) %>%
    gather(measure, mean, -State) %>%
    ggplot(aes(x = State, y = mean, color = measure, group = measure)) +
      geom_point() +
      geom_line() +
      facet_wrap(~measure, scales = "free_y")
```


Returning to one of the earlier examples, we can also compute the correlation between Height  and Weight on a per-group basis.

```{r}
patients %>% group_by(State) %>% summarise(correlation = cor(Height, Weight))
```


### `group_by` - the technical details


You might be wondering what's going on _under the hood_ with this `group_by` function. The help page for `group_by` is a little on the technical side but essentially tells us that the data frame we pass it gets converted into a `grouped_df` data frame object. `dplyr` functions that operate on a `grouped_df` object know to treat this in a special way, operating on each group separately. The following sequence of R commands might help to make this a bit clearer.

First let's take a look at the class of the patients data frame.

```{r}
class(patients)
```

Now we'll create a grouped version with `group_by` and look at its class.

```{r}
patients_grouped <- patients %>%  group_by(State)
class(patients_grouped)
```


The `groups` function lets us see the groups.


```{r}
groups(patients_grouped)
```


The `ungroup` function removes the grouping.

```{r}
patients_ungrouped <- ungroup(patients_grouped)
class(patients_ungrouped)
groups(patients_ungrouped)

```

`ungroup` can be quite helpful in more complicated chains of `dplyr` operations where the grouping is only required for some of the steps and would have unintended consequences for subsequent operations within the chain.

### Exercise - Grouping

See separate R markdown document.


## Joining


In many real life situations, data are spread across multiple tables or spreadsheets. Usually this occurs because different types of information about a subject, e.g. a patient, are collected from different sources. It may be desirable for some analyses to combine data from two or more tables into a single data frame based on a common column, for example, an attribute that uniquely identifies the subject.

`dplyr` provides a set of join functions for combining two data frames based on matches within specified columns. These operations are very similar to carrying out join operations between tables in a relational database using SQL.

### `left_join`

To illustrate join operations we'll first consider the most common type, a "left join". In the schematic below the two data frames share a common column, V1. We can combine the two data frames into a single data frame by matching rows in the first data frame with those in the second data frame that share the same value of V1.

![dplyr left join](images/dplyr_left_join.png)


`left_join` returns all rows from the first data frame regardless of whether there is a match in the second data frame. Rows with no match are included in the resulting data frame but have `NA` values in the additional columns coming from the second data frame.

Here's an example in which details about members of the Beatles and Rolling Stones are contained in two tables, using data frames conveniently provided by `dplyr`. The name column identifies each of the band members and is used for matching rows from the two tables.

```{r}
band_members
band_instruments
```


```{r}
left_join(band_members, band_instruments, by = "name")
```


`right_join` is similar but returns all rows from the second data frame that have a match with rows in the first data frame based on the specified column.

```{r}
right_join(band_members, band_instruments, by = "name")
```


### `inner_join`

It is also possible to return only those rows where matches could be made. The `inner_join` function accomplishes this.

![dplyr inner join](images/dplyr_inner_join.png)

```{r}
inner_join(band_members, band_instruments, by = "name")
```


### `full_join`

We've seen how missing rows from one table can be retained in the joined data frame using `left_join` or `right_join` but sometimes data for a given subject may be missing from either of the tables and we still want that subject to appear in the combined table. A `full_join` will return all rows and all columns from the two tables and where there are no matching values, `NA` values are used to fill in the missing values.

![dplyr full join](images/dplyr_full_join.png)


```{r}
full_join(band_members, band_instruments, by = "name")
```


### Joining on columns with different headers

The columns that are used for joining two tables do not have to have the same name, e.g.

```{r}
band_instruments2
left_join(band_members, band_instruments2, by = c("name" = "artist"))
```


### Multiple matches in join operations

Where there are multiple rows in one or both of the two tables for the thing that is being joined, these will be represented in the combined table. All combinations of the matching rows will be included.

```{r}
df1 <- data_frame(
  name = c("Mick",   "John",    "Paul",    "Paul"),
  band = c("Stones", "Beatles", "Beatles", "Wings")
)
df2 <- data_frame(
  name =  c("John",   "Paul", "Paul",   "Keith"),
  plays = c("guitar", "bass", "guitar", "guitar")
)

left_join(df1, df2, by = "name")

```


### Joining by matching on multiple columns


```{r}

df1 <- data_frame(
  forename = c("Mick",   "John",    "Paul",      "John"),
  surname =  c("Jagger", "Lennon",  "McCartney", "Squire"),
  band =     c("Stones", "Beatles", "Beatles",   "Roses")
)

df2 <- data_frame(
  forename = c("John",   "Paul",      "Keith",    "John"),
  surname =  c("Lennon", "McCartney", "Richards", "Squire"),
  plays =    c("guitar", "bass",      "guitar",   "guitar")
)
left_join(df1, df2, by = c("forename", "surname"))

```


### Filtering joins

A variation on the join operations we've considered are `semi_join` and `anti_join` that filter the rows in one table based on matches or lack of matches to rows in another table.

```{r}
# semi_join returns all rows from the first table where there are matches in the other table
semi_join(band_members, band_instruments, by = "name")

```

```{r}
# anti_join returns all rows where there is no match in the other table
anti_join(band_members, band_instruments, by = "name")
```

### Exercise - combining data

The exercise uses a more realistic dataset, building on the patients data frame we've already been working with.

The patients are all part of a diabetes study and have had their blood glucose concentration and diastolic blood pressure measured on several dates. The aim is to compare the average blood pressures of smokers and non-smokers.

See separate R markdown document.