You have a new data set, which has a number of samples in it which need to be processed. You choose one sample to experiment with, and step by step you figure out a suitable list of commands to go from raw data to final result. You realise you have to do this analysis for every sample now, so choosing some way to automate the process is now the sensible thing to do.
There are many different ways to automate a set of commands in Unix. They can be full-blown workflow engines (Taverna, Kepler), pipeline systems (BPIPE, Ruffus, Rubra), dependency methods (make, Ant, Maven), or a scripting language (Bash, Python, Perl).
The oldest and arguably simplest method is to use a shell script. All this really means is putting the series of commands you need to run into a text file, and telling the shell to run them all. Although some may baulk at the idea of a shell script, they have the advantage of being very portable and lightweight, and have many modern features that people don't realise. I will assume the use of the BASH shell, as it is available on most Unix environments, and is default on Linux and MacOSX.
A shell script
Type the following into a text editor and save it with the name "biox":
#!/bin/bash echo -n "Hello $USER, today is " date +%A
You've created a BASH script called "biox" which is meant to print a greeting to the screen.
Let's try and run our new scripot:
% biox biox: command not found
That didn't work. Ah, that's right, the current directory isn't in my PATH, so I need to refer to it directly:
% ./biox bash: ./biox: Permission denied
That didn't work either. Ah yes, only files with "execute" permission can be run:
% chmod +x biox % ./biox Hello torsten, today is Sunday
Success! An alternative way to run it is to pass it directly to the BASH command as an input script:
% bash biox Hello torsten, today is Sunday
The advantage of making it executable is that it looks and feels like a standard Unix command, and you don't have to remember what scripting language to run it with. The shell figures that out automatically for you from the magic first line
#!/bin/bash of the script.
The motivation for writing a shell script was to automate the same set of commands for many samples. This means we need some way to tell it which data files to run on each time, without having to create a separate script for each sample.
Imagine we have lots of genome sequence runs, each one for a different Staphylococcus isolate from a hospital MRSA outbreak. Each sample's data is a paired-end Illumina run conisting of two files:
Strain_R2.fastq.gz. Below is a script called
with 4 parameters which will automate the assembly of a given isolate:
#!/bin/bash if [ $# -lt 4 ]; then echo "Usage: $0 outputFolder kmerValue Read1 Read2" exit 1 fi DIR="$1" KVALUE="$2" LEFT="$3" RIGHT="$4" velveth "$DIR" "$KVALUE" -shortPaired -fmtAuto -separate "$LEFT" "$RIGHT" velvetg "$DIR" -exp_cov auto -cov_cutoff auto -very_clean yes echo "Results are in: $DIR"
The first "if" block checks to see that we provided 4 parameters, and if not, it prints a usage message and exists. The next 4 lines copy the 4 positional command line parameters into permanent, more understandable variables. The next 2 lines run velveth and velvetg accordingly, using as flexible options as possible. The final line just prints a message stating where the results are.
% ./assemble Usage: ./assemble outputFolder kmerValue Read1 Read2 % ./assemble MRSA_J19d 51 J19d_R1.fq.gz J19d_R1.fq.gz (lots of velveth output here) (lots of velvetg output here) Results are in: MRSA_J19d
It is easy to envisage modifying this script to include pre-steps like read clipping, filtering and stitching; and post-steps like gap-filling and scaffolding. The opportunities for automation are endless.
Clearly the ability to pass positional command line parameters to your script turns
it from a bunch of fixed commands into a more generic re-usable tool. To make it even more flexible however you need command line options. At this point, many people would give up on BASH and move to a scripting language such as Python or Perl. However you may be surprised to know that BASH has
getopt style of command line options!
Below is a newer version of our assembly script. It only has 3 mandatory parameters now (folder, R1, R2). The kmerValue is now default to 51, but can be changed using the -k option. We also add a new option -c to set the coverage cutoff, which defaults to auto.
#!/bin/bash # set our defaults for the options KVALUE=51 CUTOFF=auto # parse the options while getopts 'c:k:' opt ; do case $opt in c) CUTOFF=$OPTARG ;; k) KVALUE=$OPTARG ;; esac done # skip over the processed options shift $((OPTIND-1)) # check for mandatory positional parameters if [ $# -lt 3 ]; then echo "Usage: $0 [options] outputFolder Read1 Read2" echo "Options: -k kmervalue (def: $KVALUE) | -c cov_cutoff (def: $CUTOFF)" exit 1 fi DIR="$1" LEFT="$2" RIGHT="$3" # do the assembly echo "Assembling with K=$KVALUE and cutoff=$CUTOFF" velveth "$DIR" "$KVALUE" -shortPaired -fmtAuto -separate "$LEFT" "$RIGHT" velvetg "$DIR" -exp_cov auto -cov_cutoff "$CUTOFF" -very_clean yes echo "Results are in: $DIR"
As you can see, adding command line options requires little effort, and makes the script more flexible and professional.
% ./assemble Usage: ./assemble [options] outputFolder Read1 Read2" Options: -k kmervalue (def: 51) | -c cov_cutoff (def: auto)" % ./assemble -k 83 -c 3 MRSA_J19d J19d_R1.fq.gz J19d_R1.fq.gz Assembling with K=83 and cutoff=3 (lots of velveth output here) (lots of velvetg output here) Results are in: MRSA_J19d
The example scripts in this post only do a small amount error checking - they ensure the correct number of command line parameters were provided, but that is it. It doesn't check that all the options and parameters were valid eg. that the kmer value was an integer and not too small or too big, or that the read files actually existed and were readable, or that the output directory existed already or not. Neither did it check that the velveth and velvetg commands successfully ran or not. Nor did we catch the situation where the user pressed CTRL-C in the middle of our script!
The simplest thing you can do is add the following line to the top of your script:
#!/bin/bash set -e
This will cause the script to exit/fail if any of the commands that are run within the script fail (ie. they return a non-zero exit code). More detailed error handling is beyond the scope of this article but it something you should consider when automating the processing of huge data sets.
BASH scripts are still useful in today's bioinformatics world. They are a simple, portable way to automate your analyses. Smart use of command line parameters and command line options will make them even more flexible and potentially shareable with other bioinformaticians. Add in some error handling and defensive programming and you will produce quality tools without the need for higher level or more advanced scripting systems.