The Genome Factory: How Spades differs from Velvet

Friday, 30 August 2013

How Spades differs from Velvet

Introduction

Those of us working in bacterial genomics are all to familiar with de novo genome assembly. One of the first accessible and practical tools for bacterial genome assembly was Velvet. My group use Velvet a lot, and wrote the popular VelvetOptimiser software.

Since then, many alternatives to Velvet have appeared, including ABYSS, SOAPdenovo, ALLPATH-LG, SGA, Ray, and many others. The motivation for some of these alternatives was to improve performance and decrease RAM usage when assembling large, polyploid organisms which Velvet was not really designed to handle.

Despite these alternatives, Velvet has still thrived due to it having a strong user community, and still giving good, usable assemblies. But there is always room for improvement and new ideas, and I believe an excellent option for bacterial assemblies currently is SPAdes. It recently ranked very well in the GAGE-B assessment and in this post I will explain its relationship to Velvet in broad terms.

What's the same?

SPAdes is a de Bruijn graph based assembler, just like most short read assemblers, including Velvet. It breaks reads into fixed-size k-mers, builds a graph, cleans the graph, then finds paths through the graph. These paths end up as contigs.

SPAdes was originally intended for assembling MDA data. This is data that comes from single-cell sequencing using the multiple displacement amplification method for tiny amounts of input DNA. This produces wildly varying genome coverage, something which existing assemblers were not able to deal with well. But SPAdes by default now works with regular data, but it is neat that it can support MDA when required.

The target data source for SPAdes is Illumina reads. Like all de Bruijn graph based assemblers, they work best with shorter, high quality reads where indels are rare. For PGM and 454 data I would look elsewhere.

What's different?

The authors would argue, and the GAGE-B assessment supports the argument, that SPAdes does a better job than Velvet and other assemblers on microbial genome data. I have not had extensive experience with it yet, but have used it enough to now recommend it to others and trust it on my own data sets (well, as much as I trust any assembler!).

But there is a good reason SPAdes does better. It is really multiple tools in one. This integrated approach makes things much simpler to incorporate in pipelines. Here are the key steps SPAdes makes, as best I understand them:

Read error correction based on k-mer frequencies using BayesHammer
De Bruijn graph assembly at multiple k-mer sizes, not just a single fixed one.
Merging of different k-mer assemblies (good for varying coverage)
Scaffolding of contigs from PE/MP reads
Repeat resolution from PE/MP data using rectangle graphs
Contig error correction based on aligning the original reads with BWA back to contigs

Just like Velvet, it can use multiple threads for some parts of the algorithm. SPAdes produces a final "contigs.fasta" and "scaffolds.fasta" file, and a detailed log file so you can reconstruct your results. I think it is using more sophisticated dynamic methods for estimating k-mer coverage and cutoffs. Of course it takes longer to run than Velvet, but it is doing a lot more than Velvet does.

Conclusion

The SPAdes software is easy to install, has a nice clean interface, and follows my minimum standards for bioinformatics software. The authors are actively developing it, and respond to bug reports and questions. The results are good, and the computational requirements are reasonable. It is well worth trying on your own microbial data. So go and download it, try it out, and email your feedback today.

14 comments:

flashton30 August 2013 at 18:39
Nice post Torsten!

So you always use the --careful and --rectangles flags?

In the SPAdes manual it says that the --rectangles option is experimental so I left it out of my parameterisation (see below), would you recommend it?

Also, the --careful flag really slows things down in my hands, takes about 3x longer, similar for you?

http://bitsandbugs.org/2013/06/28/assembly-optimisation-impact-of-error-correction-and-a-new-assembler-spades/
ReplyDelete
Replies
anton.korobeynikov31 August 2013 at 04:18
Let me clarify the stuff a bit :) --rectangles is the proof-of-concept implementation of rectangles algorithm straight from the paper. Actually, it's a standalone tool which one can fed with graph from any other assembler provided that it's in the proper format. However, it's not pretty efficient both in space and time.

Rectangle graph is a nice abstraction (and actually we feel it's a proper way to think about repeat resolution using the rectangles), however, it cannot be generalized straightforwardly to multiple libraries. So, we went beyond this and SPAdes 2.5 implements brand new repeat resolution algorithm which extends the ideas of rectangles and can utilize multiple libraries (and even different "sources" of genomic distance information). The paper is under preparations now.

The last step of --careful processing can be pretty time consuming when the coverage is high, because it aligns all the reads back to the contigs, builds the positional de Bruijn graph and tries to use it to correct for mismatches and short indels.

PS: Let me add some spoiler: stay tuned for SPAdes support for PGM / Proton / PacBio stuff (and hybrid as well ;) )
ReplyDelete
Replies
anton.korobeynikov31 August 2013 at 08:24
> As you know, --rectangles crashes every now and then, especially when the reads are overlapping > PE from MiSeq for example. I personally don't mind if it is not efficient in space/time, as long as I
> know it will give a higher quality assembly!
Just do not use --rectangles. In SPAdes 2.5 world it should be thought as deprecated :)

> I'll also re-state that --careful seems to correct errors that I would normally correct post-assembly > in other ways. What I would like is the ability to run the correction stage on an EXISTING assembly
> from another tool eg. contigs from Newbler, and correct them with MiSeq data.
In fact you can. spades_pipeline/corrector.py is the tool in question. However, it might be user-unfriendly when run standalone. Also, note that --careful also tweaks the assembler options to be not so aggressive.
ReplyDelete
Replies
Paolo20 September 2013 at 17:11
Great post as usual Torsten! Ever since I started assembling bacterial genomes your posts were very useful to me! I'm surely off topic but, I' d like to know your opinion about software for integrating data from different assemblers (I mostly use Velvet, SOAPdenovo2, SPAdes and the a5_pipeline) such as CISA or MAIA. Keep up the good work!
ReplyDelete
Replies
Unknown31 October 2013 at 16:43
Great post!

Dr Torsten, Spades or Mira? Which one do you prefer?
ReplyDelete
Replies
HelixCode7 April 2014 at 17:09
How can you prevent velvetg and velveth from consuming the entire swap memory, Yes, we do know that it is a memory hungry software...... is there any way to tell it to use only a part of servers ram.we have capacity of 264 GB ram && 48 cpu...and still we get broken pipes :(
ReplyDelete
Replies