The Genome Factory: Minimum standards for bioinformatics command line tools

Friday, 9 August 2013

Minimum standards for bioinformatics command line tools

I don't consider myself a good software engineer, or a good tester, a good documenter, or even that good a programmer. But I have used, and (tried to) installed a LOT of bioinformatics software over the last 12 years. I've also released a lot of software, and I try to make it as painless to use as possible. From these experiences, I bring you my "Ten rules for bioinformatics software".

1. Print something if no parameters are supplied

Unless your tool is a filter which works by manipulating stdin to stdout, you should always print out something (some help text, ideally) if the user runs your tool without all the required parameters. Just exiting quietly isn't helping anyone.

% biotool

Please use the --help option to get usage information.

2. Always have a "-h" or "--help" switch

The Unix tradition is for all commands to have a "-h" or "--help" switch, which when invoked, prints usage information about the command. Most languages come with a getopt() type library, so there is no excuse for not supporting this.

% biotool -h

Usage: biotool [options] <file.fq>

Options:

--rc reverse complement

--trim nn trim <nn> bases from 3' end first

--mask remove vector sequence contaminant

3. Have a "-v" or "--version" switch

Many bioinformatics tools today are used as part of larger pipelines, or put into the Galaxy toolshed. Because compatibility is dependent on the version of your tool being used, you should have a simple, machine-parseable way to identify what version of tool you have.

% biotool --version

biotool 1.3a

4. Use stderr for messages and errors

If you need to print an error message, are just printing out progress or log information, try and use stderr rather than stdout. Try to reserve stdout for use as your output channel, so that it can be used in Unix pipes to avoid temporary files.

% biotool reads.fq | fq2fa > clean.fq

biotool: processing reads.fq

fq2fa: converted 423421 reads

5. Validate your parameters

If you have command line options, do some validation or sanity checking on them before letting them through to your critical code. Many getopt() libraries support basic validation, but ultimately it is not that difficult to have a preamble with some "if not XXX { print ERROR ; exit }" clauses.

% biotool --trim -3 reads.fq

Error: --trim must be an integer > 0

6. Don't hard-code any paths

Often the tool you write depends on some other files, such as config files or database/model files. The easiest, but wrong and annoying, thing to do is just put

% biotool --mask reads.fq

Error: can't load /home/steven/work/biotool/data/vector.seq

# ARRRGGGGHHH!

7. Don't pollute the command-line name space

You've come up with a new tool called "BioTool". The command you want everyone to invoke is called "biotool", but it is just a master script which runs lots of other tools. Unfortunately you used lots of generic names like "fasta2fastq", "convert", "filter" .. and so on, and you've put them all in the same folder at the main "biotool" script. So when I install BioTool, my PATH gets filled with rubbish. Please don't do this.

% ls -1 /opt/BioTool/
biotool
convert # whoops, clashes with ImageMagick!
load-hash.py # hello Titus :-)
filter

diff # whoops, clashes with standard Unix tool!
test.sh # <face-palm>

The first solution is to prefix all your sub-tools and helper scripts with "biotool". The second solution, if they are scripts only, is to not make them executable (so they don't go in PATH) and invoke the via the interpreter (perl, python, ...) explicitly from biotool. The third solution is too put them all in a separate folder (eg. auxiliary/, scripts/ ...) and explicitly call them (but take note of #6 above).

8. Don't distribute bare JAR files

If your tool is written in Java and is distributed as a JAR file, please write a simple shell wrapper script to make it simple to invoke. The three lines below are all you need (in the simple case) and you will make your users much happier.

#!/bin/bash
PREFIX=$(dirname $0)
java -Xmx500m -jar $PREFIX/BioTool.jar $*

9. Check that your dependencies are installed

I've installed BioTool, and I start running it, and all looks good. Then 2 hours later it spits out an error like "error: can't run sff2CA". This could all be avoided if biotool checked all the external tools it needed before it commenced, and save your users associating your software with pain.

% biotool --stitch R1.fq R2.fq
This is biotool 1.3a
Loaded config
Checking for 'bwa': found /usr/bin/bwa
Checking for 'samtools': ERROR - could not find 'samtools'
Exiting.

10. Be strict if you are still a Perl tragic like me

If you're old like me and Perl is still your native tongue, at least play it a little bit safer by starting all your scripts with the following lines:

#!/usr/bin/env perl

use strict;

use warnings;

use Fatal;

I'll shut up now :-)

9 comments:

Egon Willighagen9 August 2013 at 18:01
Sounds like you are in favor of the getopts standard for command line parameters?
ReplyDelete
Replies
Anonymous9 August 2013 at 23:12
Hi Torsten, great post! This is one of those areas of software engineering that is almost completely determined by convention (e.g., what people expect based on their prior experience with command-line tools), so you don’t have to be a great programmer to do it right. In fact, writing command-line arguments is more like writing user documentation or tutorials. I find the best approach is to simply think through the possible scenarios that your end users will encounter, and pick what you think will be the least unexpected. Here are two additional points:

RE #1: there is a long-standing convention in UNIX that tools should be written to communicate with each other through pipes, which is why a lot of programs will default to accepting stdin. But you are right that this is confusing. I've actually been caught confused by my own programs that have this behavior! One work-around is to print a help message when there are no arguments, but to accept the special filename ‘-’ in your input flag to indicate “read from stdin” (another UNIX convention), e.g.

$ cat bigfile | biotool -i -

Another workaround is to print a message to stderr indicating taht the program. This is what I do in SeqDB:

$ seqdb profile
seqdb-profile: profiling FASTQ records from ''

It would probably be even clearer if you printed another message “use - to exit” (one of the most frustrating things for a new UNIX user is not knowing how to exit a program and get back to the shell!).

RE #7: namespace pollution is a problem everywhere -- not just in bioinformatics -- and its simply because more software is available now than 50 years ago when UNIX was invented and no one had laid claim to cat, head, tail, nm, more, etc. ImageMagick had some real balls to claim convert! I think the best solution here is the model that git uses, e.g. name your programs git-* then have a single wrapper called git that forwards the user to the appropriate program. This way, you only create a single entry (git) in the namespace that is likely to conflict with other software, and all of your other entries are derived from that name an unlikely to conflict with anything. It's like domains and sub-domains for the WWW. For an example of a shell wrapper to do the forwarding, see

https://bitbucket.org/mhowison/seqdb/src/master/scripts/seqdb.in
https://bitbucket.org/caseywdunn/agalma/src/master/agalma/agalma.in

The other advantage of using a shell wrapper like this is you can set additional environment variables you need for the sub-programs.

Best,
Mark
ReplyDelete
Replies
Jonathan Jacobs13 August 2013 at 05:40
Error: can't load /home/steven/work/biotool/data/vector.seq
# ARRRGGGGHHH!

LMFAO!!!
ReplyDelete
Replies
Torsten Seemann13 August 2013 at 07:56
It's funny coz its true :-P
ReplyDelete
Replies
Anonymous31 August 2013 at 04:21
I'm glad you mentioned #4, one thing I hadn't payed much attention to.
ReplyDelete
Replies
Unknown19 September 2013 at 09:41
Nice read. Regarding #10, I'd recommend 'use autodie;' over 'use Fatal;'. According to the Fatal docs: "Fatal has been obsoleted by the new autodie pragma. Please use autodie in preference to Fatal . autodie supports lexical scoping, throws real exception objects, and provides much nicer error messages."
ReplyDelete
Replies
PumasAi31 January 2024 at 22:32
Bioinformatics software encompasses a broad array of computational tools and platforms designed to analyze, interpret, and visualize biological data.
ReplyDelete
Replies
Maxwell Harrison16 July 2026 at 18:13

Who are the Best Bitcoin Recovery Experts? Cryptocurrency Tracing and Recovery | Hire A Hacker for crypto recovery

Alpha Recovery Experts are a trusted and reliable solution for individuals seeking to recover lost or inaccessible cryptocurrency assets. With a team of expert cryptographers and cybersecurity specialists, the recovery expert utilizes cutting-edge technology to navigate the complexities of blockchain forensics and retrieve lost or stolen digital assets. ALPHA RECOVERY EXPERTS proprietary recovery protocols are designed to ensure the utmost security and confidentiality, providing his clients with peace of mind throughout the recovery process. Whether you've lost access to your wallet, forgotten your private keys, or fallen victim to a phishing scam, ALPHA RECOVERY EXPERTS are dedicated to helping you regain control of your cryptocurrency portfolio. The customer-centric approach, combined with their unparalleled technical expertise, has earned the team a reputation as a go-to solution for cryptocurrency recovery. Don't let lost or inaccessible assets hold you back - contact Alpha Recovery Experts today to take the first step towards recovery.

For More inquiries, contact them at,
Email; Alpharecoveryexpert@consultant.com
Homepage; Alpharecoveryexperts.com
ReplyDelete
Replies

Add comment