Friday, 30 August 2013

How Spades differs from Velvet

Introduction

Those of us working in bacterial genomics are all to familiar with de novo genome assembly. One of the first accessible and practical tools for bacterial genome assembly was Velvet. My group use Velvet a lot, and wrote the popular VelvetOptimiser software.

Since then, many alternatives to Velvet have appeared, including ABYSS, SOAPdenovo, ALLPATH-LG, SGA, Ray, and many others. The motivation for some of these alternatives was to improve performance and decrease RAM usage when assembling large, polyploid organisms which Velvet was not really designed to handle.

Despite these alternatives, Velvet has still thrived due to it having a strong user community, and still giving good, usable assemblies. But there is always room for improvement and new ideas, and I believe an excellent option for bacterial assemblies currently is SPAdes. It recently ranked very well in the GAGE-B assessment and in this post I will explain its relationship to Velvet in broad terms.

What's the same?

SPAdes is a de Bruijn graph based assembler, just like most short read assemblers, including Velvet. It breaks reads into fixed-size k-mers, builds a graph, cleans the graph, then finds paths through the graph. These paths end up as contigs.

SPAdes was originally intended for assembling MDA data. This is data that comes from single-cell sequencing using the multiple displacement amplification method for tiny amounts of input DNA. This produces wildly varying genome coverage, something which existing assemblers were not able to deal with well. But SPAdes by default now works with regular data, but it is neat that it can support MDA when required.

The target data source for SPAdes is Illumina reads. Like all de Bruijn graph based assemblers, they work best with shorter, high quality reads where indels are rare. For PGM and 454 data I would look elsewhere.

What's different?

The authors would argue, and the GAGE-B assessment supports the argument, that SPAdes does a better job than Velvet and other assemblers on microbial genome data. I have not had extensive experience with it yet, but have used it enough to now recommend it to others and trust it on my own data sets (well, as much as I trust any assembler!).

But there is a good reason SPAdes does better. It is really multiple tools in one. This integrated approach makes things much simpler to incorporate in pipelines. Here are the key steps SPAdes makes, as best I understand them:

  1. Read error correction based on k-mer frequencies using BayesHammer
  2. De Bruijn graph assembly at multiple k-mer sizes, not just a single fixed one.
  3. Merging of different k-mer assemblies (good for varying coverage)
  4. Scaffolding of contigs from PE/MP reads
  5. Repeat resolution from PE/MP data using rectangle graphs
  6. Contig error correction based on aligning the original reads with BWA back to contigs

Just like Velvet, it can use multiple threads for some parts of the algorithm. SPAdes produces a final "contigs.fasta" and "scaffolds.fasta" file, and a detailed log file so you can reconstruct your results. I think it is using more sophisticated dynamic methods for estimating k-mer coverage and cutoffs. Of course it takes longer to run than Velvet, but it is doing a lot more than Velvet does.

Conclusion

The SPAdes software is easy to install, has a nice clean interface, and follows my minimum standards for bioinformatics software. The authors are actively developing it, and respond to bug reports and questions. The results are good, and the computational requirements are reasonable. It is well worth trying on your own microbial data. So go and download it, try it out, and email your feedback today.

14 comments:

  1. Nice post Torsten!

    So you always use the --careful and --rectangles flags?

    In the SPAdes manual it says that the --rectangles option is experimental so I left it out of my parameterisation (see below), would you recommend it?

    Also, the --careful flag really slows things down in my hands, takes about 3x longer, similar for you?

    http://bitsandbugs.org/2013/06/28/assembly-optimisation-impact-of-error-correction-and-a-new-assembler-spades/

    ReplyDelete
    Replies
    1. I do not currently use --rectangles.

      I always use the --careful flag. It seems to reduce the number of errors I need to correct afterwards with Nesoni, so I think it is mostly getting it right. I have a 64-core machine, do the BWA step goes pretty fast (usually use 16 threads).

      Delete
  2. Let me clarify the stuff a bit :) --rectangles is the proof-of-concept implementation of rectangles algorithm straight from the paper. Actually, it's a standalone tool which one can fed with graph from any other assembler provided that it's in the proper format. However, it's not pretty efficient both in space and time.

    Rectangle graph is a nice abstraction (and actually we feel it's a proper way to think about repeat resolution using the rectangles), however, it cannot be generalized straightforwardly to multiple libraries. So, we went beyond this and SPAdes 2.5 implements brand new repeat resolution algorithm which extends the ideas of rectangles and can utilize multiple libraries (and even different "sources" of genomic distance information). The paper is under preparations now.

    The last step of --careful processing can be pretty time consuming when the coverage is high, because it aligns all the reads back to the contigs, builds the positional de Bruijn graph and tries to use it to correct for mismatches and short indels.

    PS: Let me add some spoiler: stay tuned for SPAdes support for PGM / Proton / PacBio stuff (and hybrid as well ;) )

    ReplyDelete
    Replies
    1. Anton, thanks for replying here.

      As you know, --rectangles crashes every now and then, especially when the reads are overlapping PE from MiSeq for example. I personally don't mind if it is not efficient in space/time, as long as I know it will give a higher quality assembly!

      I'll also re-state that --careful seems to correct errors that I would normally correct post-assembly in other ways. What I would like is the ability to run the correction stage on an EXISTING assembly from another tool eg. contigs from Newbler, and correct them with MiSeq data.

      Support for PGM would be great. We have lots of PGM data including PGM 6Kbp MATE PAIR (linker sequence). I'd be happy to share some if you need it. We currently use a binary-patched Newbler to support it, and it does pretty well. If Spades could combine that with MiSeq for the same sample I'd be in heaven.

      Delete
  3. > As you know, --rectangles crashes every now and then, especially when the reads are overlapping > PE from MiSeq for example. I personally don't mind if it is not efficient in space/time, as long as I
    > know it will give a higher quality assembly!
    Just do not use --rectangles. In SPAdes 2.5 world it should be thought as deprecated :)

    > I'll also re-state that --careful seems to correct errors that I would normally correct post-assembly > in other ways. What I would like is the ability to run the correction stage on an EXISTING assembly
    > from another tool eg. contigs from Newbler, and correct them with MiSeq data.
    In fact you can. spades_pipeline/corrector.py is the tool in question. However, it might be user-unfriendly when run standalone. Also, note that --careful also tweaks the assembler options to be not so aggressive.

    ReplyDelete
    Replies
    1. Thanks for the reply Anton. The latest Spades is more stable that we are getting less of these problems now.

      Delete
  4. Great post as usual Torsten! Ever since I started assembling bacterial genomes your posts were very useful to me! I'm surely off topic but, I' d like to know your opinion about software for integrating data from different assemblers (I mostly use Velvet, SOAPdenovo2, SPAdes and the a5_pipeline) such as CISA or MAIA. Keep up the good work!

    ReplyDelete
    Replies
    1. I have never had much luck with hybrid assembly. I tried using Newbler with 454 + Illumina data, but found it did worse. I have not tried CISA or MAIA but I worry they have not been maintained since release and publication?

      Delete
  5. Great post!

    Dr Torsten, Spades or Mira? Which one do you prefer?

    ReplyDelete
    Replies
    1. I don't have enough experience with Mira yet, although I know some of my UK colleagues speak highly of it. It used to need all your reads to be the same length which was a bit problematic for me.

      Delete
  6. How can you prevent velvetg and velveth from consuming the entire swap memory, Yes, we do know that it is a memory hungry software...... is there any way to tell it to use only a part of servers ram.we have capacity of 264 GB ram && 48 cpu...and still we get broken pipes :(

    ReplyDelete
    Replies
    1. You must be assembling a very large genome to use that much memory. You have 3 choices:

      1. use a different assembler
      Try something like Gossamer or Spades or Minia or SGA which use more memory efficient data structures.

      2. reduce the data
      Use khmer digital normalization to reduce your data down to 20x coverage and remove noisy and non-informative reads.
      Or simply just quality filtering your reads and using only half of them might help.


      3. get more RAM
      Not really feasible for you probably, but i was lucky to get access to a 1TB machine a few years ago when Velvet was the only assembler available.

      Delete
    2. Hello sir,
      Sorry to bother you agian, yes my coverage is 27x and i was sucessfully able to run velvetoptimizer and velveth.Now in velvetg at the k-value of 61(step down size of 10) my 270gb ram is becoming fully occupied. I was reading jermy leipzig's blog he suggests using ulimit -v option before every process is run. Will affect how my ram will be utilized? Can u suggest how to utilize the same??

      Delete
    3. ulimit just controls what limitations each process you run will have. Most people have -v on by default (no limit to memory). You simply do not have enough RAM to use Velvet at k=61 on your data set. Please just try a more memory efficient assembler: Spades, Minia, Gossamer, Abyss.

      Delete