It is much more convenient and efficient to use a relational database to store complex relational data than using flat data files on the file system, especially when we are dealing with a mix of tabular and spatial data. From my experience I have put together a super simple tutorial for setting up a PostgreSQL + R environment for data analysis.
# Install postgresql and postgis from pacman sudo pacman -S postgresql sudo pacman -S postgis # Change ownership of postgres folder to the postgres user sudo chown -c -R postgres:postgres /var/lib/postgres # Initiate the first database as postgres user sudo runuser -l postgres -c "initdb -D '/var/lib/postgres/data'" # Start the database service sudo systemctl start postgresql # Create role in database for the yourself createuser -s -U postgres --interactive # and enter your username # Create default database for yourself createdb username # the db name is same as user # Now we can connect to the database interactively psql
That completes the database installation. Now connecting from R to read and write data.
# Install and load the necessary packages install.packages("RPostgreSQL") library(RPostgreSQL) # Create a connection object driver <- dbDriver("PostgreSQL") connection <- dbConnect(driver, host="localhost") # Generating random sample data to work with sample <- data.frame(x=rnorm(1000,0,10), y=rnorm(1000,0,10)) # Saving the dataframe in the database as a table # parameters are connection, "table name", dataframe object dbWriteTable(connection,"sample",sample) # Querying data from the database sample <- dbGetQuery(connection,"select * from sample;") sample <- dbGetQuery(connection,"select * from sample where x < 0;") # Appending more stuff to existing table newdf <- data.frame(x=rnorm(500,0,10), y=rnorm(500,0,10)) dbWriteTable(connection,"sample", newdf, append=TRUE)
Thats it! Now we have all our data in the database and can do our analysis in R. Advantages of this approach are,
- Storage is efficient. Size of the data in a database is much less than storing them on desk as csvs.
- Complex queries are fast. Provided we index the data properly, filtering, joining and sorting data is much more faster than loading a csv and doing it in R.
- Interoperability, When we are good at using databases with R, we can connect it directly to much larger open databases over internet to do the same analysis. Alternatively, we can connect other programs to the database to access the same data.
Second part showing how to allow remote connections is here.
One thought on “Setting up a PostgreSQL + R environment for data analysis (Arch linux) – Part 1”