Since I started learning R and moved away from proprietary data analysis and GIS packages, I have been amazed by the availability of free data-sets and tools enabling people to make awesome analysis and visualisations. Today we’ll look into a simple exercise of taking open data from different sources and combining them using opensource and free tools to produce maps which help us understand the data better.
We’ll use data from two sources, a tabular file on distribution of national institutes of higher education in India from here and shape file containing geographic information on boundaries of Indian states from here. I have cleaned and prepared both data which can be downloaded here. First we need to download the archive and extract/ unzip it to our working directory. Once that is done we can start combining and plotting the data.
# Loading the data data <- read.csv("institutions.csv") library(rgdal) # library for reading shape files states <- readOGR(".", "india_state")
Note that, in readOGR, the first parameter is the folder at which the shape file is kept (it should be “.” if we directly unzipped the files to working directory) and second one is the name of the shape file.
# calculating the total number institutions and # the number of institutions per 10 million people data <- data %>% mutate( Total = rowSums(.[,3:10]) )%>% mutate( Totalppm = Total/Population) # merging the data into the shapefile states <- states %>% merge(data,by="state")
Now we have succesfully merged the data into the shape file we can see this by asking states@data at the console. Now we need to map the data using tmap. First of all we load the library by running library(tmap). This is complex package and has a lot of dependencies so it might take a while to install and download. As a first step we plot just the borders of the state by running,
tm_shape(states) + tm_borders(col = "grey")
We can add labels to the map by adding a tm_text layer to it. Notice that the order of the layer is important since things overlap on each other.
tm_shape(states) + tm_borders(col = "grey") + tm_text("state", root = 10, size = "AREA")
Now we can plot a single variable on the map as the fill colour for the states. For example If we want to highlight all the states with an IIT, we do,
tm_shape(states) + tm_fill("IIT") + tm_borders(col = "grey") + tm_text("state", root = 10, size = "AREA")
We can plot multiple variables on the same plot side by side by just passing a vector of variables to compare. Notice that we switched of legends on one of the layer here by setting legend.size.show=FALSE.
tm_shape(states) + tm_fill(c("IIT", "IIM")) + tm_borders(col = "grey") + tm_text("state", root = 10, size = "AREA", legend.size.show = FALSE)
Finally we can plot the number of institutions per population by,
tm_shape(states) + tm_fill("Totalppm", title = "Institutions per 10m people") + tm_borders(col = 'grey') + tm_text("state", root = 10, size = "AREA", legend.size.show = FALSE)
This map is very uninformative because of the small Himalayan states with low populations skewing the whole distribution. So we have to classify this data ourselves by giving a custom breaks parameter,
tm_shape(states) + tm_fill("Totalppm", breaks = (1:10) / 2, title = "Institutions per 10m people") + tm_borders(col = 'grey') + tm_text("state", root = 10, size = "AREA", legend.size.show = FALSE)
Now this shows a possible north south divide in the distribution of institutions per person. This may be because of most of the national institutions in North being located in Delhi, while in the South Bangalore, Mumbai and Chennai compete for them.
That completes the post for today. To summarise we took a tabular data, joined and plotted it with geographic data and uncovered new information which are not present in them individually!