Saturday, 9 June 2012

Recent changes in the Velvet de novo ecosystem

A few weeks ago we managed to lure an old colleague, Dave Powell, back into academia to join the VBC team. One of his many talents is software engineering, so I immediately got him working on some projects related to the Velvet de novo assembler that I really wanted to do but could never find time.

One of the plans was to write a GUI for Velvet. Browsing sourceforge and the like, it seems a few people had attempted and failed. Last year I supervised a 4th year software engineering group project, who managed to get a basic prototype working, but the code was unmaintainable. So Dave and I planned it out, and he had a prototype done by the next day. He implemented it in JRuby, which can be distributed as a standard portable Java .jar file. We called it VAGUE.

However, there are few issues with the current command line Velvet that was limiting the usefulness of VAGUE to its beginner target audience:

  1. Paired-end read files have to be interleaved first, even though Illumina produce them as separate files
  2. The user needed to know what format they were in eg. fastq, fasta
  3. It only supports GZIP compression, even though a lot of sequencing centres are using BZIP2
  4. People don't know what K value to start with
We could have put these features into VAGUE, but it seemed much more sensible to put them directly into Velvet itself. So that's what Dave did! With a bit of code refactoring, Velvet can now do these things.
  • velveth now has a -separate option for paired-read files (default is -interleaved):
    velveth dir 31 -shortPaired -fastq -separate reads_R1.fq reads_R2.fq

  • velveth now has a -fmtAuto option which auto-detects file-type and compression-type. Note that each file can be of different format and compression - very elegant.
    velveth dir 31 -shortPaired -separate -fmtAuto left.fastq.gz right.fa.bz2

  • velveth now supports BZIP2 compressed files, and if you have pbzip2 (parallel bzip) installed it will use that, and if you have pigz (parallel gzip) it will also use that
For the issue of choosing K, I wrote a very simple Perl script which actually counts the K-mers in your actual reads and tells you the K-mer coverage for your target genome for all possible values of K.  It's called VelvetK and it has a --best option which we use in VAGUE to automatically choose a reasonable K-mer value to assemble with. 

I think these new features and tools will help introduce Velvet and de novo assembly to more people, and hopefully help move power to the biologists and give more time to the bioinformaticians to develop better tools.