Parallelisation of data processing pipelines in shell using ‘xargs’ (and parallel)

I like scripting with shell a lot since it is simple, available almost everywhere and is fast and efficient. Recently while reading this I came across a way to make data processing several times faster using ‘xargs’ command. xargs is the command which takes input from a stdin stream and passes it to a chosen command as command line arguments. It can also execute these arguments simultaneously as independent processes when combined with the -P switch. This makes sure that all the processors/cores in the system are used simultaneously.

As an example, lets consider a slightly complex counting algorithm I have made for my research. This is an Rscript which reads csv data from stdin, analyses and produces a csv result to stdout. The pipeline looks something like this,

cat input.csv | ./count > output.csv

This takes around 20 seconds to complete. Now imagine there are 2 files input_1.csv and input_2.csv in data folder and we want to run the count script on both of them. The obvious way to do this is manually,

cat input_1.csv | ./count > output_1.csv
cat input_2.csv | ./count > output_2.csv

This takes 40 seconds. This can also be written in a for loop for scalability.

for i in $(ls data); do cat data/$i | ./count > data/$i_out; done

Update: feedback from reddit suggest that ls command is not appropriate here. it is better to do for i in data/*

This also takes 40 seconds since the processing of the files are done serially rather than simultaneously. If we see the output of htop only one processor would be used for the processing. Any modern computer has anywhere from 2-8 cores which could be used to run the script on the files simultaneously. This can be done with ‘xargs’ as shown below,

find data/ -name "input*" -print0 | xargs -0 -n1 -P0 sh -c 'cat "$@" | ./count > "$@_out" _;

This does the exact same this as the one before but takes only half the time – 20 seconds because it uses both the processors. The find command finds all the files starting with “input” in the data folder and prints it out separated by null values (-print0). xargs takes this as input reads the null seperated list (-0) and creates as many processes possible (-P0), with one argument each (-n1) and executes the sh -c ‘pipeline’ command. Each pipeline is started simultaneously and the number of parallel processes started depends on the number of cores available in the machine. Within the pipeline we can refer to the argument (file name) as $@.

The change from 40 seconds to 20 seconds doesn’t seem much here but when I use this in a server with 24 cores on 500 files, the difference is from 2:48hrs to 7minutes!

Update: Feedback from reddit suggest that there is a cleaner way of doing this using gnu parallel.

find data/ -name "input*" | parallel "cat {} | ./count > {}_out"

This can be extended to multiple machines via ssh as well, thus giving us a simple cluster!

Leave a Reply