Data manipulation basics with tidyverse – Part 2 – Basic functions

In part 1 we saw how to use pipes to pass data between functions so that we can write R code like a sentence. The second impressive thing with tidyverse is the grammar for manipulating data. The way the functions are structured and named in tidyverse gives us a consistent way of writing R code which is clear, concise and readable. Except for very few cases, I almost always find myself using just 5 basic function with tidyverse ,

  • select – select columns
  • filter – select records
  • mutate – modify columns
  • summarise – combine records
  • arrange – arrange records

consider the sample table below,

| name  |  year | sex | town   |
 ----------------------------
|  A    |  1998 |  M  | London |
|  B    |  1995 |  M  | Berlin |
|  C    |  1994 |  F  | London |
|  D    |  2000 |  F  | Madrid |
|  E    |  1995 |  M  | Berlin |

1) Select function is to select vertical columns from the table. for example, select(year,sex,town) will return table with just the the three columns selected. We can even rename the columns as we select them (or use rename() as well), select( year, sex, city = town)

|  year | sex | city   |
 ---------------------
|  1998 |  M  | London |
|  1995 |  M  | Berlin |
|  1994 |  F  | London |
|  2000 |  F  | Madrid |
|  1995 |  M  | Berlin |

2) Filter function is to select records based on a criteria. for example, filter(year < 2000) will select only records where year is less than 2000. we can even combine multiple criteria with logical operators & (and) and | (or),

|  year | sex | city   |
 ---------------------
|  1998 |  M  | London |
|  1995 |  M  | Berlin |
|  1994 |  F  | London |
|  1995 |  M  | Berlin |

3) Mutate function modifies columns. for example, mutate(age = 2018 - year) will create a new column with name age and calculate it based on year,

|  year | sex | city   |  age |
 -------------------------------
|  1998 |  M  | London |  20  |
|  1995 |  M  | Berlin |  22  |
|  1994 |  F  | London |  23  |
|  1995 |  M  | Berlin |  22  |

4) Summarise is a two-part function which combines records based on one or more columns based on a formula (function). for example, if we need average age of people in cities according to gender, we can do – group_by(city,sex) %>% summarise(average.age=mean(age)) gives us,

|  city  | sex |  average.age |
-------------------------------
| Berlin |  M  |       23     |
| London |  F  |       24     |
| London |  M  |       20     |

5) Arrage function arranges records based on the value in the columns specified. for example, arrange(average.age) gives us,

|  city  | sex |  average.age |
-------------------------------
| London |  M  |      20      |
| Berlin |  M  |      23      |
| London |  F  |      24      |

I have found that most of the data manipulation can be done combining these 5 functions in tidyverse and the best part is that the resulting code translates really well to english. All the stuff we did earlier can be written down in a single line, clearly without any intermediate objects or referring to the data we are working on repeatedly. For example,

people %>% 
    select(year, sex, city) %>%
    filter(year < 2000) %>%
    mutate(age = 2018 - year) %>%
    group_by(city, sex) %>% summarise(average.age = mean(age))
    arrange(average.age)

which translates to,
Take the people table, select year, sex and city columns, filter for records where year is less than 2000, calculate age column from year, group the table by city and age and find out average age for the groups and arrange the records by age.

One thought on “Data manipulation basics with tidyverse – Part 2 – Basic functions”

Leave a Reply