Friday, 9 August 2013

Minimum standards for bioinformatics command line tools

I don't consider myself a good software engineer, or a good tester, a good documenter, or even that good a programmer. But I have used, and (tried to) installed a LOT of bioinformatics software over the last 12 years. I've also released a lot of software, and I try to make it as painless to use as possible. From these experiences, I bring you my "Ten rules for bioinformatics software".

1. Print something if no parameters are supplied

Unless your tool is a filter which works by manipulating stdin to stdout, you should always print out something (some help text, ideally) if the user runs your tool without all the required parameters. Just exiting quietly isn't helping anyone.

% biotool
Please use the --help option to get usage information.

2. Always have a "-h" or "--help" switch

The Unix tradition is for all commands to have a "-h" or "--help" switch, which when invoked, prints usage information about the command. Most languages come with a getopt() type library, so there is no excuse for not supporting this.

% biotool -h
Usage: biotool [options] <file.fq>
--rc       reverse complement
--trim nn  trim <nn> bases from 3' end first
--mask     remove vector sequence contaminant

3. Have a "-v" or "--version" switch

Many bioinformatics tools today are used as part of larger pipelines, or put into the Galaxy toolshed. Because compatibility is dependent on the version of your tool being used, you should have a simple, machine-parseable way to identify what version of tool you have.

% biotool --version
biotool 1.3a

4. Use stderr for messages and errors

If you need to print an error message, are just printing out progress or log information,  try and use stderr rather than stdout. Try to reserve stdout for use as your output channel, so that it can be used in Unix pipes to avoid temporary files. 

% biotool reads.fq | fq2fa > clean.fq
biotool: processing reads.fq
fq2fa: converted 423421 reads

5. Validate your parameters

If you have command line options, do some validation or sanity checking on them before letting them through to your critical code. Many getopt() libraries support basic validation, but ultimately it is not that difficult to have a preamble with some "if not XXX { print ERROR ; exit }" clauses.

% biotool --trim -3 reads.fq 
Error: --trim must be an integer > 0

6. Don't hard-code any paths

Often the tool you write depends on some other files, such as config files or database/model files. The easiest, but wrong and annoying, thing to do is just put

% biotool --mask reads.fq 
Error: can't load /home/steven/work/biotool/data/vector.seq

7. Don't pollute the command-line name space

You've come up with a new tool called "BioTool". The command you want everyone to invoke is called "biotool", but it is just a master script which runs lots of other tools. Unfortunately you used lots of generic names like "fasta2fastq", "convert", "filter" .. and so on, and you've put them all in the same folder at the main "biotool" script. So when I install BioTool, my PATH gets filled with rubbish. Please don't do this.

% ls -1 /opt/BioTool/
convert      # whoops, clashes with ImageMagick! # hello Titus :-)
diff         # whoops, clashes with standard Unix tool!      # <face-palm>

The first solution is to prefix all your sub-tools and helper scripts with "biotool". The second solution, if they are scripts only, is to not make them executable (so they don't go in PATH) and invoke the via the interpreter (perl, python, ...) explicitly from biotool. The third solution is too put them all in a separate folder (eg. auxiliary/, scripts/ ...) and explicitly call them (but take note of #6 above).

8. Don't distribute bare JAR files

If your tool is written in Java and is distributed as a JAR file, please write a simple shell wrapper script to make it simple to invoke. The three lines below are all you need (in the simple case) and you will make your users much happier.

PREFIX=$(dirname $0)
java -Xmx500m -jar $PREFIX/BioTool.jar $*

9. Check that your dependencies are installed

I've installed BioTool, and I start running it, and all looks good. Then 2 hours later it spits out an error like "error: can't run sff2CA". This could all be avoided if biotool checked all the external tools it needed before it commenced, and save your users associating your software with pain. 

% biotool --stitch R1.fq R2.fq
This is biotool 1.3a
Loaded config
Checking for 'bwa': found /usr/bin/bwa
Checking for 'samtools': ERROR - could not find 'samtools'

10. Be strict if you are still a Perl tragic like me

If you're old like me and Perl is still your native tongue,  at least play it a little bit safer by starting all your scripts with the following lines:

#!/usr/bin/env perl
use strict;
use warnings;
use Fatal;

I'll shut up now :-)


  1. Sounds like you are in favor of the getopts standard for command line parameters?

  2. Hi Torsten, great post! This is one of those areas of software engineering that is almost completely determined by convention (e.g., what people expect based on their prior experience with command-line tools), so you don’t have to be a great programmer to do it right. In fact, writing command-line arguments is more like writing user documentation or tutorials. I find the best approach is to simply think through the possible scenarios that your end users will encounter, and pick what you think will be the least unexpected. Here are two additional points:

    RE #1: there is a long-standing convention in UNIX that tools should be written to communicate with each other through pipes, which is why a lot of programs will default to accepting stdin. But you are right that this is confusing. I've actually been caught confused by my own programs that have this behavior! One work-around is to print a help message when there are no arguments, but to accept the special filename ‘-’ in your input flag to indicate “read from stdin” (another UNIX convention), e.g.

    $ cat bigfile | biotool -i -

    Another workaround is to print a message to stderr indicating taht the program. This is what I do in SeqDB:

    $ seqdb profile
    seqdb-profile: profiling FASTQ records from ''

    It would probably be even clearer if you printed another message “use - to exit” (one of the most frustrating things for a new UNIX user is not knowing how to exit a program and get back to the shell!).

    RE #7: namespace pollution is a problem everywhere -- not just in bioinformatics -- and its simply because more software is available now than 50 years ago when UNIX was invented and no one had laid claim to cat, head, tail, nm, more, etc. ImageMagick had some real balls to claim convert! I think the best solution here is the model that git uses, e.g. name your programs git-* then have a single wrapper called git that forwards the user to the appropriate program. This way, you only create a single entry (git) in the namespace that is likely to conflict with other software, and all of your other entries are derived from that name an unlikely to conflict with anything. It's like domains and sub-domains for the WWW. For an example of a shell wrapper to do the forwarding, see

    The other advantage of using a shell wrapper like this is you can set additional environment variables you need for the sub-programs.


  3. Error: can't load /home/steven/work/biotool/data/vector.seq


  4. I'm glad you mentioned #4, one thing I hadn't payed much attention to.

    1. One of the 'debates' on this issue is where to print the -help and -version information to: Stderr or Stdout? I use Stderr for _everything_ not related to algorithm output, but some people disagree. Of course they are wrong ;-)

  5. Nice read. Regarding #10, I'd recommend 'use autodie;' over 'use Fatal;'. According to the Fatal docs: "Fatal has been obsoleted by the new autodie pragma. Please use autodie in preference to Fatal . autodie supports lexical scoping, throws real exception objects, and provides much nicer error messages."