The Genome Factory: August 2013

Friday, 30 August 2013

How Spades differs from Velvet

Introduction

Those of us working in bacterial genomics are all to familiar with de novo genome assembly. One of the first accessible and practical tools for bacterial genome assembly was Velvet. My group use Velvet a lot, and wrote the popular VelvetOptimiser software.

Since then, many alternatives to Velvet have appeared, including ABYSS, SOAPdenovo, ALLPATH-LG, SGA, Ray, and many others. The motivation for some of these alternatives was to improve performance and decrease RAM usage when assembling large, polyploid organisms which Velvet was not really designed to handle.

Despite these alternatives, Velvet has still thrived due to it having a strong user community, and still giving good, usable assemblies. But there is always room for improvement and new ideas, and I believe an excellent option for bacterial assemblies currently is SPAdes. It recently ranked very well in the GAGE-B assessment and in this post I will explain its relationship to Velvet in broad terms.

What's the same?

SPAdes is a de Bruijn graph based assembler, just like most short read assemblers, including Velvet. It breaks reads into fixed-size k-mers, builds a graph, cleans the graph, then finds paths through the graph. These paths end up as contigs.

SPAdes was originally intended for assembling MDA data. This is data that comes from single-cell sequencing using the multiple displacement amplification method for tiny amounts of input DNA. This produces wildly varying genome coverage, something which existing assemblers were not able to deal with well. But SPAdes by default now works with regular data, but it is neat that it can support MDA when required.

The target data source for SPAdes is Illumina reads. Like all de Bruijn graph based assemblers, they work best with shorter, high quality reads where indels are rare. For PGM and 454 data I would look elsewhere.

What's different?

The authors would argue, and the GAGE-B assessment supports the argument, that SPAdes does a better job than Velvet and other assemblers on microbial genome data. I have not had extensive experience with it yet, but have used it enough to now recommend it to others and trust it on my own data sets (well, as much as I trust any assembler!).

But there is a good reason SPAdes does better. It is really multiple tools in one. This integrated approach makes things much simpler to incorporate in pipelines. Here are the key steps SPAdes makes, as best I understand them:

Read error correction based on k-mer frequencies using BayesHammer
De Bruijn graph assembly at multiple k-mer sizes, not just a single fixed one.
Merging of different k-mer assemblies (good for varying coverage)
Scaffolding of contigs from PE/MP reads
Repeat resolution from PE/MP data using rectangle graphs
Contig error correction based on aligning the original reads with BWA back to contigs

Just like Velvet, it can use multiple threads for some parts of the algorithm. SPAdes produces a final "contigs.fasta" and "scaffolds.fasta" file, and a detailed log file so you can reconstruct your results. I think it is using more sophisticated dynamic methods for estimating k-mer coverage and cutoffs. Of course it takes longer to run than Velvet, but it is doing a lot more than Velvet does.

Conclusion

The SPAdes software is easy to install, has a nice clean interface, and follows my minimum standards for bioinformatics software. The authors are actively developing it, and respond to bug reports and questions. The results are good, and the computational requirements are reasonable. It is well worth trying on your own microbial data. So go and download it, try it out, and email your feedback today.

Sunday, 25 August 2013

Paired-end read confusion - library, fragment or insert size?

Introduction

When you do an Illumina sequencing run, you need to choose between single-end (SE) or paired-end (PE) sequencing. When sequencing, we chop up our DNA into small fragments, and then ligate some adaptors. Then, for SE, we only sequence one end of a DNA fragment. For PE, we sequence both ends of the same fragment:

fragment                  ========================================
fragment + adaptors    ~~~========================================~~~
SE read                   --------->
PE reads                R1--------->                    <---------R2
unknown gap                         ....................

The two reads you get from PE sequencing are referred to as R1 and R2, and they come from the same piece of DNA. Usually the length of the fragment is much longer than the length of R1+R2, so there is a "gap" in between them. Although we don't know the sequence of DNA in between R1 and R2, we have still gained useful information from the knowledge that R1 and R2 are next to each other with a known orientation and distance apart.

Mind the gap

There is a lot of confusion about the gap of unknown bases. You will encounter terms like "insert size", "fragment size", "library size" and variations thereof. The term "insert" comes from a time before NGS existed, when cloning DNA in E.coli vectors was standard business.

PE reads      R1--------->                    <---------R2
fragment     ~~~========================================~~~
insert          ========================================
inner mate                ....................

The main confusion is with "insert size". The name itself suggests it is the unknown gap because it is "inserted" between R1 and R2, but this is misleading. It is more accurate to think of the insert as the piece of DNA inserted between the adaptors which enable amplification and sequencing of that piece of DNA. So the "insert" actually encompasses R1 and R2 as well as the unknown gap between them. The name for the gap itself is better named "inner mate distance" because it is self-descriptive and can vary depending on what read lengths you sequenced a DNA library with.

Overlapping reads

The Illumina MiSeq instrument has added to the confusion recently. Firstly, it can produce PE reads of length 250bp. Secondly, the Nextera preparation method is sensitive and can produce a lot of small fragments, shorter than 500bp. This results in R1 and R2 actually overlapping each other!

fragment          ~~~========================================~~~
insert               ========================================
R1                   ------------------------->                    
R2                                   <-----------------------
overlap                              ::::::::::
stitched SE read     --------------------------------------->

This can actually be a desirable outcome, as you can stitch R1 and R2 together to make a super-long SE read, with extra confidence of the middle bases from consensus of the overlapping sections of R1 and R2.

Adaptor read-through

If the distribution of fragment sizes is too low, or very wide, you can get the situation where not only do the reads overlap, but they are longer than the fragment itself! This causes R1 and R2 to read into the adaptors:

tiny fragment       ~~~~========================~~~~
insert                  ========================
R1                      -------------------------->                    
R2                   <--------------------------
read-through         !!!                        !!!

If your MiSeq is configured properly, it will automatically trim/mask any adaptor sequence. This will be obvious by your FASTQ file containing reads of different length, or by the presence of lots of Ns at the 5' end of your reads. If it is not configured properly, you will get adaptors in your reads, and these will cause all sorts of problems with downstream applications. You should remove these using a read trimming tool.

Conclusion

Paired-end reads are a neat molecular biology trick. Remember that "insert" refers to the DNA fragment between the adaptors, and not the gap between R1 and R2. Instead we refer to that as the "inner mate distance". In some cases, when reads overlap, the inner mate distance can actually be negative. If you are using MiSeq data, you need to be vigilant about checking for adaptor read-through and overlapping reads.

References

Friday, 9 August 2013

Minimum standards for bioinformatics command line tools

I don't consider myself a good software engineer, or a good tester, a good documenter, or even that good a programmer. But I have used, and (tried to) installed a LOT of bioinformatics software over the last 12 years. I've also released a lot of software, and I try to make it as painless to use as possible. From these experiences, I bring you my "Ten rules for bioinformatics software".

1. Print something if no parameters are supplied

Unless your tool is a filter which works by manipulating stdin to stdout, you should always print out something (some help text, ideally) if the user runs your tool without all the required parameters. Just exiting quietly isn't helping anyone.

% biotool

Please use the --help option to get usage information.

2. Always have a "-h" or "--help" switch

The Unix tradition is for all commands to have a "-h" or "--help" switch, which when invoked, prints usage information about the command. Most languages come with a getopt() type library, so there is no excuse for not supporting this.

% biotool -h

Usage: biotool [options] <file.fq>

Options:

--rc reverse complement

--trim nn trim <nn> bases from 3' end first

--mask remove vector sequence contaminant

3. Have a "-v" or "--version" switch

Many bioinformatics tools today are used as part of larger pipelines, or put into the Galaxy toolshed. Because compatibility is dependent on the version of your tool being used, you should have a simple, machine-parseable way to identify what version of tool you have.

% biotool --version

biotool 1.3a

4. Use stderr for messages and errors

If you need to print an error message, are just printing out progress or log information, try and use stderr rather than stdout. Try to reserve stdout for use as your output channel, so that it can be used in Unix pipes to avoid temporary files.

% biotool reads.fq | fq2fa > clean.fq

biotool: processing reads.fq

fq2fa: converted 423421 reads

5. Validate your parameters

If you have command line options, do some validation or sanity checking on them before letting them through to your critical code. Many getopt() libraries support basic validation, but ultimately it is not that difficult to have a preamble with some "if not XXX { print ERROR ; exit }" clauses.

% biotool --trim -3 reads.fq

Error: --trim must be an integer > 0

6. Don't hard-code any paths

Often the tool you write depends on some other files, such as config files or database/model files. The easiest, but wrong and annoying, thing to do is just put

% biotool --mask reads.fq

Error: can't load /home/steven/work/biotool/data/vector.seq

# ARRRGGGGHHH!

7. Don't pollute the command-line name space

You've come up with a new tool called "BioTool". The command you want everyone to invoke is called "biotool", but it is just a master script which runs lots of other tools. Unfortunately you used lots of generic names like "fasta2fastq", "convert", "filter" .. and so on, and you've put them all in the same folder at the main "biotool" script. So when I install BioTool, my PATH gets filled with rubbish. Please don't do this.

% ls -1 /opt/BioTool/
biotool
convert # whoops, clashes with ImageMagick!
load-hash.py # hello Titus :-)
filter

diff # whoops, clashes with standard Unix tool!
test.sh # <face-palm>

The first solution is to prefix all your sub-tools and helper scripts with "biotool". The second solution, if they are scripts only, is to not make them executable (so they don't go in PATH) and invoke the via the interpreter (perl, python, ...) explicitly from biotool. The third solution is too put them all in a separate folder (eg. auxiliary/, scripts/ ...) and explicitly call them (but take note of #6 above).

8. Don't distribute bare JAR files

If your tool is written in Java and is distributed as a JAR file, please write a simple shell wrapper script to make it simple to invoke. The three lines below are all you need (in the simple case) and you will make your users much happier.

#!/bin/bash
PREFIX=$(dirname $0)
java -Xmx500m -jar $PREFIX/BioTool.jar $*

9. Check that your dependencies are installed

I've installed BioTool, and I start running it, and all looks good. Then 2 hours later it spits out an error like "error: can't run sff2CA". This could all be avoided if biotool checked all the external tools it needed before it commenced, and save your users associating your software with pain.

% biotool --stitch R1.fq R2.fq
This is biotool 1.3a
Loaded config
Checking for 'bwa': found /usr/bin/bwa
Checking for 'samtools': ERROR - could not find 'samtools'
Exiting.

10. Be strict if you are still a Perl tragic like me

If you're old like me and Perl is still your native tongue, at least play it a little bit safer by starting all your scripts with the following lines:

#!/usr/bin/env perl

use strict;

use warnings;

use Fatal;

I'll shut up now :-)

Thursday, 1 August 2013

How to pronounce "de Bruijn"

For those who work in bioinformatics the phrase "de Bruijn graph" will be familiar. Although it technically refers to a mathematical construct, it is commonly used in bioinformatics to refer to the data structure used to store DNA read connections in many de novo assembly algorithms such as EULER (*), Velvet, Gossamer and lots of other implementations.

Although we all agree the de Bruijn graph is a "really neat idea", what we don't seem to agree on is how to pronounce it!

Now, the de Bruijn graph is named after the mathematician Nicolaas Govert de Bruijn. Nicolaas was from the Netherlands, so he is Dutch. The word "bruijn" essentially means the colour brown - although modern spelling would be "bruin".

Here are the various ways I have heard it pronounced, some of them in successive talks at ISMB 2013 in the same session even:

broon
broo-en, brewin'
bra-jen, brar-djen
brin
brun
bruggin, bruggen
broin, broyn

All of these are wrong. Even this BioStar post appears to be wrong...

The closest sounding word in English is "BROWN".

The closest sounding word in German is probably "BRAUN".

Here is a recording of his whole name being pronounced in Dutch.

If you want to add some authenticity, try and roll the "R" a little bit.

Hopefully this puts an end to the confusion once and for all.

(*) "Euler" is another German mathematician whose name is often mispronounced in English. Most people say "yool-er" but it should be "oiler" (audio file). So assembly software takes an "Oil-air-e-an Path through a de Brown graph" :-)