Visualising flows as Sankey diagrams with R

This one is on making quick and easy Sankey diagrams with R (and networkD3 package)  for exploring data. All we need to do is to understand how to convert data into a network and rest is really easy. We’ll create a random sample data-set which shows the room at which people were at three instances – morning, afternoon and evening and go on to visualise how people flow from each room over time. We’ll use the tidyverse stuff which I mentioned in this and this post.

First we need to create a random set of data. we do this by generating 100 random names and assign them to 5 rooms randomly  for three instances.

# load required libraries
library(randomNames)
library(tidyverse)

# generate people names
people <- randomNames(100, which.names = 'first')
# generate a set pf rooms
rooms <- paste(rep("Room", 5), 1:5)
# populate data-set by combining both
morning <- sample(rooms, 100, replace=TRUE)
afternoon <- sample(rooms, 100, replace=TRUE)
evening <- sample(rooms, 100, replace=TRUE)
data <- data.frame( people, morning, afternoon, evening)

head(data) #gives us
  people   morning afternoon evening
1 Symone    Room 3  Room 3    Room 4
2 Adrian    Room 5  Room 1    Room 2
3 Orlando   Room 3  Room 4    Room 2
4 Cristal   Room 5  Room 4    Room 2
5 Emily     Room 4  Room 1    Room 4
6 Elizabeth Room 4  Room 2    Room 4

Now that we have the data, we will try to calculate how people move between rooms from morning to evening. We’ll create a network of rooms at a time period with number of people moving between them as links.

# first we calculate number of people moving 
# between morning to afternoon for each room
# we label the rooms uniquely for morning and
# afternoon by adding "m_" and "a_"
mor_to_aft <- data %>% 
    mutate(
          from = paste0("m_", morning),
          to = paste0("a_", afternoon)) %>% 
    group_by(from, to) %>% 
    summarise(people = length(people))

# we do the same for afternoon to evening
aft_to_eve <- data %>% 
    mutate(
          from = paste0("a_", afternoon),
          to = paste0("e_", evening)) %>% 
    group_by(from, to) %>% 
    summarise(people = length(people))

# and we combine both to create links data
links <- bind_rows(mor_to_aft, aft_to_eve)
links # gives us
      from       to   people
1 m_Room 1 a_Room 1      6
2 m_Room 1 a_Room 2      2
3 m_Room 1 a_Room 3      1
4 m_Room 1 a_Room 4      6
5 m_Room 1 a_Room 5      2
6 m_Room 2 a_Room 1      3

Now we need to make the nodes, we do that by finding all unique instances of rooms in the links and indexing them from 0 (this is because of d3 and the way javascript works).

nodes <- c(links$from, links$to) %>% 
    unique() %>% 
    data.frame(name = .) %>% 
    mutate(id = as.numeric(row(.)) - 1)

Now we have to join these indexes into the links so that the network package understands the relationship between these two objects.

links <- links %>%
    left_join(nodes,by=c("from"="name")) %>%
    left_join(nodes,by=c("to"="name")) %>%
    ungroup() %>%
    select(from=id.x,to=id.y,people)

That completes data preparation. Now we have a network of time_rooms which linked by people moving between them. This can be plotted by,

library(networkD3)
sankeyNetwork(links, nodes, "from", "to", "people", NodeID = "name")

which produces,

2018-01-11-210848

Here we can clearly see which rooms had the most people at a given time and where did those people come from and where did they go in the next session. We can use the same technique to produce amazing complex diagrams visualising complex interactions at multiple levels like these ones 1, 2, 3, 4.

Leave a Reply