It is highly unlikely that anyone who has been working with geographic data in the last 5 years has not come across the OpenStreetMap project. The project started in 2004 as a crowd sourced solution to create an open free geographic database of the world has now exploded in to a movement in itself with more than a million users and volunteers and even provides better coverage and quality of data in some cases than some commercial data providers. The project is even more exciting for a person with a background of working in developing countries, where the biggest problem faced is the availability of geographic data itself.

OpenStreetMap – Advantages
The biggest strength of OSM is that it is open which gives complete freedom to each person in the world in terms of creating, editing and consuming the database. It is also a non-profit project funded through donations run by a community of volunteers which makes it free from all influences and pressures of the market and also gives the project access to potentially unlimited amounts of data which, in my opinion, cannot be matched by any commercial surveys. It also remains as the only hope for professionals working in geographic information sciences for accessing geographic data for developing countries.
[vimeo http://vimeo.com/56374742 w=700 & h=400]
OpenStreetMap – Disadvantages
In spite of having the clear advantages mentioned above, OSM project also has its own share of challenges as well. The prime one is its quality and coverage. Being a crowd sourced project, it is impossible for OSM to maintain the quality of the database within strict standards. It also has problems of standardization stemming from it’s free tagging policy which itself is the backbone for its richness. The general (yet largely true) assumption is that the community will monitor and balance itself in the long run to maintain the quality and standard of the database. The third is the size and complexity of the data generated. Being a global and general purpose project, OSM generates a huge and complex database compared to the regional and specific data-sets collected and distributed by commercial institutions and government.
As noted in the last post, after putting lots of energy in to understanding the database and the project for the last three months it was time to find ways to utilizing the database by extracting meaningful data for geographic analysis and finalize a proper toolkit for doing it. The tool kit had to be open and free like OSM data itself and had to be flexible and versatile enough to accommodate a variety of formats and finally had to be powerful enough to handle the scale and complexity of the data-set, the requirements which ‘R’ seemed to perfectly fulfill.

R Project for Statistical Computing
R is a programming language for statistical analysis and visualization developed in 1993 by Robert Gentleman and Ross Ihaka, which was introduced to me in GISS module of the M.Res course with CASA, UCL as a free, command line alternative to ArcGIS for Geographic analysis and visualization Though it was a bit difficult to grasp the concept of command line based system and negate the steep learning curve demanded by the language in the beginning, the advantages of R was apparent after using it for some time. The first and foremost advantage is that it is open and free (as in both lunch and freedom) compared to the equally powered, super-costly commercial GIS packages. Second is the flexibility and versatility offered in terms of supported data formats for input and output used in almost every field of study (Biology, Economics, Geography, etc.) which is made possible by the extensive support from the community of developers who make specialized packages extending the functionality of the language. When one considers all the above advantages, R makes a clear choice for becoming the central tool for the tool kit intended to be built for the extraction, manipulation and analysis of OSM data.
Initial Attempts
I have to admit that my first attempt to use R to analyze and visualize OSM data was a complete disaster. It was done for a course work, where I was trying to output a land use map of a city with OSM extracts for the particular city and R. I was working with the package ‘osmar’ and within few hours of experiments it was apparent there were a lot of problems arising from my approach. The first problem was that the data-set was huge at a city level. Though I knew it beforehand, I never expected it to be unmanageable. During my initial runs, R used to take as long as 15 min to load the data and some times gave “memory not sufficient” errors as well. So I had to restrict my attempt with smaller cities, which had a smaller data-set. The second problem was the coverage as you can clearly see in the map below the land use information was not complete and left huge holes in the map I was trying to create. Final one was the tags, which were neither standardized nor consistent. So when I tried printing a land use map, it had 42 categories of land uses and made no sense at all. If I wanted to make any sense out of the data I had to manually sort all these categories to a standard classification, which is not feasible with bigger cities like London.

Problems faced
After this exercise I realized the need to find a way to breakdown and filter the OSM data into more manageable part in terms of geography and tags and also to calculate basic statistics on the filtered data. For example even though ‘osmar’ provides ways to import OSM data in XML format and convert them to ‘sp’ files, It cannot filter the data geographically beyond a bounding box. moreover the OSM API for downloading such data also restricts itself to 20,000 features which too small a scale for many practical purposes. It also imports all the features at once without any option to select the features you want to import based on the tags or type(polygons,lines etc). The resulting file is also devoid of any projections which makes it harder to do any geographic analysis over it.
Customized function
This experience along with the final course work in the GISS module, gave me the opportunity to develop a function (aosm -Analyze OpenStreetMap) combining the functionality provided by the ‘osmar’ package in R along with OSM tools such as osmosis (java based tool to manipulate OSM data). The original plan was to create function to produce a precise plot of geographic data from OSM data with options to filter the data geographically (polygons) and based on tags(highways, amenities etc.) but I decided to keep the plotting part out of the function to keep it flexible and add a small analysis component to make it more useful . So the overall plan now is as shown in the flow chart below. I wanted the function to be aware of the environment (to check and see if local data is present and convert them to suitable formats), flexible (in terms of input data) and extendable with an option to add more functionality later on. The tool kit envisioned here is regional OSM extracts for the base data, osmosis for filtering data, ‘osmar’ for importing and converting data, ‘maptools’/’rgeos’ for geographic analysis of the data and expecting to output an ‘sp’ object with the data preserved in it for further analysis.

the function will look like this,
object.name<-aosm(“city”,”filter-polygon”,”tags”,”analysis”,”type”)where ‘city’ is the name of the city for which the data has to be downloaded, ‘filter-polygon’ is the name of the .poly file which denotes the specific geographic area one is concerned with (eg. boroughs in London , ‘tags’ is the key and values in the specific keys which need to be extracted, ‘analysis’ is the type of analysis you want to do on the file (eg. ‘default’ will return an ‘sp’ object, ‘utm’ will return a projected ‘sp’ object, ‘area’ will return the sum of all areas of the features etc) and ‘type’ denoting the type of the features (lines, points, polygons).
I have already started working on the function and I am in the final stages of producing a first usable draft. Will be updating the blog with the code and the results shortly once the code is usable.