Setting up a PostgreSQL + R environment for data analysis (Arch linux) – Part 1

It is much more convenient and efficient to use a relational database to store complex relational data than using flat data files on the file system, especially when we are dealing with a mix of tabular and spatial data. From my experience I have put together a super simple tutorial for setting up a PostgreSQL + R environment for data analysis.

# Install postgresql and postgis from pacman
 sudo pacman -S postgresql
 sudo pacman -S postgis

# Change ownership of postgres folder to the postgres user
 sudo chown -c -R postgres:postgres /var/lib/postgres

# Initiate the first database as postgres user
 sudo runuser -l postgres -c "initdb -D '/var/lib/postgres/data'"

# Start the database service
 sudo systemctl start postgresql

# Create role in database for the yourself
 createuser -s -U postgres --interactive # and enter your username

# Create default database for yourself
 createdb username # the db name is same as user

# Now we can connect to the database interactively
 psql

That completes the database installation. Now connecting from R to read and write data.

# Install and load the necessary packages
install.packages("RPostgreSQL")
library(RPostgreSQL)

# Create a connection object 
driver <- dbDriver("PostgreSQL")
connection <- dbConnect(driver, host="localhost")

# Generating random sample data to work with
sample <- data.frame(x=rnorm(1000,0,10), y=rnorm(1000,0,10))

# Saving the dataframe in the database as a table
# parameters are connection, "table name", dataframe object
dbWriteTable(connection,"sample",sample)

# Querying data from the database
sample <- dbGetQuery(connection,"select * from sample;")
sample <- dbGetQuery(connection,"select * from sample where x < 0;")

# Appending more stuff to existing table
newdf <- data.frame(x=rnorm(500,0,10), y=rnorm(500,0,10))
dbWriteTable(connection,"sample", newdf, append=TRUE)

Thats it! Now we have all our data in the database and can do our analysis in R. Advantages of this approach are,

  1. Storage is efficient. Size of the data in a database is much less than storing them on desk as csvs.
  2. Complex queries are fast. Provided we index the data properly, filtering, joining and sorting data is much more faster than loading a csv and doing it in R.
  3. Interoperability, When we are good at using databases with R, we can connect it directly to much larger open databases over internet to do the same analysis. Alternatively, we can connect other programs to the database to access the same data.

Second part showing how to allow remote connections is here.

 

One thought on “Setting up a PostgreSQL + R environment for data analysis (Arch linux) – Part 1”

Leave a Reply