The Genome Factory: May 2014

Introduction

You have a new data set, which has a number of samples in it which need to be processed. You choose one sample to experiment with, and step by step you figure out a suitable list of commands to go from raw data to final result. You realise you have to do this analysis for every sample now, so choosing some way to automate the process is now the sensible thing to do.

There are many different ways to automate a set of commands in Unix. They can be full-blown workflow engines (Taverna, Kepler), pipeline systems (BPIPE, Ruffus, Rubra), dependency methods (make, Ant, Maven), or a scripting language (Bash, Python, Perl).

The oldest and arguably simplest method is to use a shell script. All this really means is putting the series of commands you need to run into a text file, and telling the shell to run them all. Although some may baulk at the idea of a shell script, they have the advantage of being very portable and lightweight, and have many modern features that people don't realise. I will assume the use of the BASH shell, as it is available on most Unix environments, and is default on Linux and MacOSX.

A shell script

Type the following into a text editor and save it with the name "biox":

#!/bin/bash
echo -n "Hello $USER, today is "
date +%A

You've created a BASH script called "biox" which is meant to print a greeting to the screen.

Running it

Let's try and run our new scripot:

% biox
biox: command not found

That didn't work. Ah, that's right, the current directory isn't in my PATH, so I need to refer to it directly:

% ./biox
bash: ./biox: Permission denied

That didn't work either. Ah yes, only files with "execute" permission can be run:

% chmod +x biox
% ./biox
Hello torsten, today is Sunday

Success! An alternative way to run it is to pass it directly to the BASH command as an input script:

% bash biox
Hello torsten, today is Sunday

The advantage of making it executable is that it looks and feels like a standard Unix command, and you don't have to remember what scripting language to run it with. The shell figures that out automatically for you from the magic first line #!/bin/bash of the script.

Script Parameters

The motivation for writing a shell script was to automate the same set of commands for many samples. This means we need some way to tell it which data files to run on each time, without having to create a separate script for each sample.

Imagine we have lots of genome sequence runs, each one for a different Staphylococcus isolate from a hospital MRSA outbreak. Each sample's data is a paired-end Illumina run conisting of two files: Strain_R1.fastq.gz and Strain_R2.fastq.gz. Below is a script called assemble with 4 parameters which will automate the assembly of a given isolate:

#!/bin/bash
if [ $# -lt 4 ]; then
  echo "Usage: $0 outputFolder kmerValue Read1 Read2"
  exit 1
fi
DIR="$1"
KVALUE="$2"
LEFT="$3"
RIGHT="$4"
velveth "$DIR" "$KVALUE" -shortPaired -fmtAuto -separate "$LEFT" "$RIGHT"
velvetg "$DIR" -exp_cov auto -cov_cutoff auto -very_clean yes
echo "Results are in: $DIR"

The first "if" block checks to see that we provided 4 parameters, and if not, it prints a usage message and exists. The next 4 lines copy the 4 positional command line parameters into permanent, more understandable variables. The next 2 lines run velveth and velvetg accordingly, using as flexible options as possible. The final line just prints a message stating where the results are.

% ./assemble
Usage: ./assemble outputFolder kmerValue Read1 Read2

% ./assemble MRSA_J19d 51 J19d_R1.fq.gz J19d_R1.fq.gz 
(lots of velveth output here)
(lots of velvetg output here)
Results are in: MRSA_J19d

It is easy to envisage modifying this script to include pre-steps like read clipping, filtering and stitching; and post-steps like gap-filling and scaffolding. The opportunities for automation are endless.

Script Options

Clearly the ability to pass positional command line parameters to your script turns it from a bunch of fixed commands into a more generic re-usable tool. To make it even more flexible however you need command line options. At this point, many people would give up on BASH and move to a scripting language such as Python or Perl. However you may be surprised to know that BASH has inbuilt support for the getopt style of command line options!

Below is a newer version of our assembly script. It only has 3 mandatory parameters now (folder, R1, R2). The kmerValue is now default to 51, but can be changed using the -k option. We also add a new option -c to set the coverage cutoff, which defaults to auto.

#!/bin/bash
# set our defaults for the options
KVALUE=51
CUTOFF=auto
# parse the options
while getopts 'c:k:' opt ; do
  case $opt in
    c) CUTOFF=$OPTARG ;;
    k) KVALUE=$OPTARG ;;
  esac
done
# skip over the processed options
shift $((OPTIND-1)) 
# check for mandatory positional parameters
if [ $# -lt 3 ]; then
  echo "Usage: $0 [options] outputFolder Read1 Read2"
  echo "Options: -k kmervalue (def: $KVALUE) | -c cov_cutoff (def: $CUTOFF)"
  exit 1
fi
DIR="$1"
LEFT="$2"
RIGHT="$3"
# do the assembly
echo "Assembling with K=$KVALUE and cutoff=$CUTOFF"
velveth "$DIR" "$KVALUE" -shortPaired -fmtAuto -separate "$LEFT" "$RIGHT"
velvetg "$DIR" -exp_cov auto -cov_cutoff "$CUTOFF" -very_clean yes
echo "Results are in: $DIR"

As you can see, adding command line options requires little effort, and makes the script more flexible and professional.

% ./assemble
Usage: ./assemble [options] outputFolder Read1 Read2"
Options: -k kmervalue (def: 51) | -c cov_cutoff (def: auto)"

% ./assemble -k 83 -c 3 MRSA_J19d J19d_R1.fq.gz J19d_R1.fq.gz 
Assembling with K=83 and cutoff=3
(lots of velveth output here)
(lots of velvetg output here)
Results are in: MRSA_J19d

Error handling

The example scripts in this post only do a small amount error checking - they ensure the correct number of command line parameters were provided, but that is it. It doesn't check that all the options and parameters were valid eg. that the kmer value was an integer and not too small or too big, or that the read files actually existed and were readable, or that the output directory existed already or not. Neither did it check that the velveth and velvetg commands successfully ran or not. Nor did we catch the situation where the user pressed CTRL-C in the middle of our script!

The simplest thing you can do is add the following line to the top of your script:

#!/bin/bash
set -e

This will cause the script to exit/fail if any of the commands that are run within the script fail (ie. they return a non-zero exit code). More detailed error handling is beyond the scope of this article but it something you should consider when automating the processing of huge data sets.

Conclusion

BASH scripts are still useful in today's bioinformatics world. They are a simple, portable way to automate your analyses. Smart use of command line parameters and command line options will make them even more flexible and potentially shareable with other bioinformaticians. Add in some error handling and defensive programming and you will produce quality tools without the need for higher level or more advanced scripting systems.

The Genome Factory

Sunday 25 May 2014

Adding command line options to your shell scripts