The Genome Factory: Using Velvet with mate-pair sequences

Saturday, 8 September 2012

Using Velvet with mate-pair sequences

Introduction

Illumina sequencing instruments (HiSeq, MiSeq, Genome Analyzer) can produce three main types of reads when sequencing genomic DNA:

Single-end
Each "read" is a single sequence from one end of a DNA fragment. The fragment is usually 200-800bp long, with the amount being read can be chosen between 50 and 250 bp.
Paired-end
Each "read" is two sequences (a pair) from each end of the same genomic DNA fragment (more info). The distance between the reads on the original genome sequence is equal to the length of the DNA fragment that was sequenced (usually 200-800 bp).
Mate-pair:
Like paired-end reads, each "read" is two sequences from each end of the same DNA fragment, but the DNA fragment has been engineered from a circularization process (more info) such that the distance between the reads on the original genome sequence is much longer (say 3000-10000 bp) than the proxy DNA fragment (200-800 bp).

Single-end library ("SE")

When we got the original Illumina Genome Analyzer, all it could do was 36 bp single-end reads, and each lane gave us a massive 250 Mbp, and we had to walk 7 miles through snow in the dark to get it. Ok, that last bit is clearly false as we don't get snow in Australia and we speak metric here, but the point is that there is still plenty of legacy SE data around, and SE reads are still used in RNA-Seq sometimes. Let's imagine our data was provided as a standard FASTQ file called SE.fq:

velveth Dir 31 -short -fastq SE.fq

velvetg Dir -exp_cov auto -cov_cutoff auto

I strongly recommend enabling the -exp_cov auto and -cov_cutoff auto options. They will almost always improve the quality of your assemblies.

Paired-end library ("PE")

Paired-end reads are the standard output of most Illumina sequencers these days, currently 2x100bp for the HiSeq and 2x150bp for the GAIIx and MiSeq, but they are all migrating to 2x250bp soon. The two sequences per paired read are typically distributed in two separate files, the "1" file contains all the "left" reads and the "2" file contains all the corresponding "right" reads. Let's imagine our paired-end run gave us two files in standard FASTQ format, PE_1.fq and PE_2.fq:

velveth Dir 31 -shortPaired -separate -fastq PE_1.fq PE_2.fq

velvetg Dir -exp_cov auto -cov_cutoff auto

Previously you had to interleaved the left and right files for Velvet, but we recently added support to Velvet for the -separate option which we hope is now saving time and disk space throughout the Velvetsphere!

Mate-pair library ("MP")

Mate-pair reads are extremely valuable in a de novo setting as they provide long-range information about the genome, and can help link contigs together into larger scaffolds. They have been used reliably for years on the 454 FLX platform, but used less often on the Illumina platform. I think the main reasons for this are the poorer reliability of the Illumina mate-pair protocol and the larger amount of DNA required compared to a PE library.

We can consider MP reads as the same as PE reads, but with a larger distance between them ("insert size"). But there is one technical difference due to the circularization procedure used in their preparation. PE reads are oriented "opp-in" (L=>.....<=R), whereas MP reads are oriented "opp-out" (L<=.....=>R). Velvet likes its paired reads to be in opp-in orientation, so we need to reverse-complement all our MP reads first, which I do here with the EMBOSS "revseq" tool.

revseq -sequence MP_1.fq -outseq rcMP_1.fq -notag

revseq -sequence MP_2.fq -outseq rcMP_2.fq -notag

velveth Dir 31 -shortPaired -separate -fastq rcMP_1.fq rcMP_2.fq

velvetg Dir -exp_cov auto -cov_cutoff auto -shortMatePaired yes

Early Illumina MP libraries are often contaminated with PE reads (the so-called shadow library) which are the result of imperfect selection of biotin-marked fragments in the circularization process. There is a special option in velvetg (not velveth) called -shortMatePaired added by Sylvain Foret which informs Velvet that a paired channel is MP but may contain PE reads, which helps it to account for them and avoid mis-scaffolding. I recommend using this no matter how pure you think your MP library is.

Combining them all! (SE + PE + MP)

When de novo assembling multiple libraries in Velvet, you should order them from smallest insert size to largest insert size. For our case, this means the SE first, then the PE, then the MP. Each library must go in its own "channel" as it comes from a differently prepared DNA library. Channels are specified in Velvet with a numerical suffix on the read-type parameter (absence means channel 0):

velveth \

Dir 31 \

-short -fastq SE.fq \

-shortPaired2 -separate -fastq PE_1.fq PE_2.fq \

-shortPaired3 -separate -fastq rcMP_1.fq rcMP_2.fq

velvetg \

Dir \

-exp_cov auto -cov_cutoff auto \

-shortMatePaired3 yes

Note that the -shortMatePaired option has been allocated to channel 3 now (the -shortPaired3 library) as that is the MP channel.

Conclusions

It's relatively to get up and running with Velvet, but when your projects become more complicated, the methods in this post should help you. But if you prefer a nice GUI to take care of most of the issues discussed here, I recommend using our Velvet GUI called VAGUE (Velvet Assembler Graphical User Environment).

47 comments:

Linda9 September 2012 at 23:57
That's a very informative post. Do you know how would you handle SOLiD reads? Reverse complimenting SOLiD reads is not the same as that for Illumina. Also, can you do an assembly with mate=paired reads alone?

Is VAGUE able to handle SOLiD data?
ReplyDelete
Replies
Unknown10 September 2012 at 20:32
Great post Torsten. Do you have any actual stats on how much better this makes an assembly with real world data? PS. I cant wait for a decent miSEQ MP protocol, PE is rubbish for de novo assembly.

Linda, Velvet does a terrible job of SOLiD data. Your best option is to assemble a genome with illumina or 454 or torrent data, then overlay the SOLiD LMP data and use it to fix some errors, and to order your contigs.
ReplyDelete
Replies
Torsten Seemann11 September 2012 at 10:12
De novo assembly of SOLiD colour space data is fraught with problems. All assembly software I know only works in base space. The SOLiD community site has details of a pipeline to convert/assemble/merge but I don't think it works very well.

You are right in the the orientation of SOLiD mate pairs is different again, I think it is (L=> R=>).

Yes, you can do an assembly of just MP reads.

Ultimately de novo assembly works best with very long SE reads. PE and MP reads are a "hack" around this. The variability in the insert/fragment size makes linking PE/MP reads a bit tricky, although Pevzner is working on "rectangle graphs" for this sort of data.

ReplyDelete
Replies
Anonymous22 September 2012 at 23:05
I really like the '-separate' option! Is it true that this option is not documented in the manual?
ReplyDelete
Replies
Torsten Seemann26 September 2012 at 19:52
That is false. The -separate option is in the Velvet manual (Manual.pdf) since version 1.2.07.

https://github.com/tseemann/velvet/raw/master/Manual.pdf
ReplyDelete
Replies
WallyCuda18 December 2012 at 09:11
Hi Torsten,

Thanks for the very informative post! It was just what (I thought) I needed. I'm trying to use Velvetoptimiser to assemble a small eukaryotic genome from two Illumia HiSeq libs- one PE and one MP.

Do you know if/how I can pass the Velvetg cmds. for the MP lib. as you did? It seems that one can pass most Velvet cmds. via Velvetoptimiser, but there is not a lot of documentation on doing something like this... at least I've yet to find it.
Thanks
Walt
ReplyDelete
Replies
Torsten Seemann19 December 2012 at 08:53
Yes, you can pass through the velveth and velvetg options in this post to VelvetOptimiser with the -f and -o options. These are all described the the VOpt manual in the tarball you download. If you have any trouble, just email Simon Gladman with the command line that failed. His email is in the manual.

But first see if you can assemble without VOpt for a single K value. If you can't get that working, there is no point putting it into VOpt.
ReplyDelete
Replies
Unknown16 February 2013 at 00:00
Hi Torsten, thanks for you informative post
one little thing:
when using revseq you have to specify both input and output format correctly and explicitely, otherwise (e.g. with your code) you end up with a fasta file...

revseq -sequence MP.fq -sformat1 fastq-sanger -outseq rc_MP.fq -osformat2 fastq-sanger -notag

would be correct for newer illumina data.
Cheers
ReplyDelete
Replies
Torsten Seemann16 February 2013 at 10:53
Harald,
Thanks for the information on EMBOSS revseq. To be honest, I don't use revseq - I use my own tool - but I wanted my post to be portable. I should have tested it properly beforehand.
Torsten.
ReplyDelete
Replies
Unknown14 March 2013 at 15:09
This comment has been removed by the author.
ReplyDelete
Replies
Unknown14 March 2013 at 15:13
This comment has been removed by the author.
ReplyDelete
Replies
Unknown28 May 2013 at 02:17
Hi,
So I am trying to assemble a low complexity genome. The simulated paired end reads (illumina Miseq) which I am getting out of metasim shows a good genomic coverage. But it gives just one fasta file which I am using as an input for velveth.
My aim is to get a good contig length distribution which I am not getting. I guess I am doing something wrong in giving input parameters. Is it that velvet does not accept a single fasta file for paired read information? Do I have to input 2 different files (forward read and reverse read files)?

Thanks
Ashwani
ReplyDelete
Replies
Torsten Seemann28 May 2013 at 06:37
Velvet can accept paired-end reads in two ways. The first is as separate left and right files (via the -separate option). The second is via a single interleaved/shuffled file (default).

A large low complexity genome will never give a good assembly using only paired end reads because there are too many repeats which can not be resolved.
ReplyDelete
Replies
Torsten Seemann8 June 2013 at 16:28
I use Velvet with 150bp and even 250bp MiSeq reads without any problems. With these length reads the best K-mer value is around K=100 rather than K=31 like it used to be with very short 36bp reads. But to achieve K=100, you still need to ensure your depth is high enough to make the K-mer coverage (not read coverage) high enough (50-100).
ReplyDelete
Replies
Unknown17 June 2013 at 23:26
Hi Torsten,

As stated by you I tried k mer value of around 100 and it worked for a metagenome (say 10 genomes). The meta velvet also gave good results. The number of reads in this case was 1 million. But as I increased the read number to > 4 million the assembly degrades, even if I use a very large k mer value. So apart from K mer is there any other parameter which needs to be varied?

Thanks
Ashwani
ReplyDelete
Replies
Torsten Seemann18 June 2013 at 11:14
It is difficult to know what is going on with your data. I'm not really sure what you mean by "assembly degrades"?

As you add more reads, more "lower proportion" organisms may pass the cutoffs that Velvet is using, and you will get more contigs.
ReplyDelete
Replies
Dave Wheeler8 August 2013 at 11:56
Just the info I was after - thanks!

With regard to the question about longer kmer values though (ie ~100), won't the increased probability of encountering a sequencing error start to create problems in the assembly. There will also be less overlaps that fit into this big kmer value. With the exception of long repeats, I would have thought inappropriate overlaps greater than 50 bp would be rare, so not much is gained anyway. Hope that makes sense!
ReplyDelete
Replies
Torsten Seemann8 August 2013 at 16:03
You are right that the probability of erroneous K-mers does increase with K. You can only increase K if you also increase your read coverage too, so that we get enough correct K-mers (signal) to counteract the false K-mers (noise). Read clipping and read correction are often used to drastically reduce the number of false K-mers being put into the graph, which reduces memory usage and improves graph simplification.

For bacteria we usually have lots of coverage, sometimes > 1000x so using higher K-mers improves the assembly in practice. The error rate of Illumina is well below 1% now, so false K-mers are less likely than one might first expect (after clipping). Some studies have shown that K=24 is "mostly unique" in genomes, but mostly doesn't make them all unique.

Thanks for taking the time to comment.

Torst
ReplyDelete
Replies
Matteo Brilli16 October 2013 at 21:06
In which version there is the possibility of using the -separate option for paired-ends reads? in Version 1.2.03 I get an unknown option error for that.
ReplyDelete
Replies
Dan9 May 2014 at 09:22
Hi Torsten, great post. One thing I'd point out is that Velvet does not compile with 3 categories (for 3 libraries) by default (it defaults to 2) so to use more than 2 libraries in an assembly, you need to compile velvet with the CATEGORIES option.
ReplyDelete
Replies
Hamdi Kitapci20 May 2014 at 06:17
Hi Torsten,
Thanks for the this informative post. Do you know any tool designed to trim (and reverse the reads if necessary) for adapters specifically for illumina mate-pair libraries ? Based on the junction adapter(s) present in the read MPs might have different orientation so reversing all the reads seems not the best option. I could write my own code for this but I wanted to check first to see if there is anything available.

Thanks
Hamdi
ReplyDelete
Replies
bioinfo1 August 2014 at 06:05
Hi Torsten,
Thank you for such an informative post.
I am using metavelvet to assemble metagenome generated by Ion Proton, there sre 23 million SE reads.
I have used the following commands:
velveth runAll_a/ 21 -short ionExpress_nofilter.fasta

velvetg runAll_a/ -exp_cov auto

meta-velvetg runAll_a/

The last line of the output is:

Final graph has 4272026 nodes and n50 of 101, max 1932, total 178452348, using 19006699/23565054 reads

I am confused or rather surprised with this line. Isn't the contigs are too many.
Also, when I ran the python script "scriptEstimatedCovMulti.py" I am getting peak as[-1] and I am not sure what that is.
The coverage is also very low.
I would like to know if the method I am adopting is correct or I need to modify some steps to get improved results.
I will be thankful if any one could help me figure out if I am on right track.

Thanks!
ReplyDelete
Replies
Unknown29 September 2015 at 19:32
Hi Torsten,
Thanks for the this informative post.
I now using hybrid data of pair-end and mate-pair reads to assemble. I want to perform error correct before assemble. Should I reverse complement the mate-pair reads before I correct the error for all of my data. Or can I correct them and then reverse complement the mate-pair data before assemble?
ReplyDelete
Replies
dryas31 October 2015 at 22:30
Hi Torsten,
I was wondering how using PE and MP reads of different lengths would affect the best k-mer size? If you have, say, a best k-mer size for the PE reads that is longer than the MP read length.

ReplyDelete
Replies
#IamKathir#InSearchingOfLife24 November 2015 at 17:21
Hi Torsten,

I started to do velvet recently, i want to know all the basic things about Velvet. i hate the official manual which one is i downloaded from the Velevt official page. Could you pls sugeest me any of the Articles and Velvet manual...

Best,

K.Kathir
ReplyDelete
Replies
song15 December 2015 at 14:53
Hi Torsten,
recently i was trying to assemble my genome with 3 libraries,and how to determin the -ins_length option if i want to assemble with all the libraries? Is it necessary ao add this option to improve my assembly results? if i assemble with separate library ,how could i combine these results into one ?
Looking forward to your reply
ReplyDelete
Replies
saravanan selvam24 October 2016 at 22:54
Hello Torsten,
You are really doing a wonderful work by writing here . It really helps the beginners in de novo assembly.Here, i have one question that why all 3 libraries (SE reads, PE reads and Mate PE reads) of an organism should be combined and assembled? Combining 3 libraries will improve the assembly quality? could you answer me this?
ReplyDelete
Replies
Amaría15 April 2020 at 00:43
Could you provide a removal adapter for Mate pair library before use velvet

Thanks!
ReplyDelete
Replies
Alexis Grace11 August 2021 at 22:30
All thanks to Dr OLIHA for curing my herpes virus/hpv with his herbal medicine, i do not have much to say but with all my life i will forever be grateful to him and God Almighty for using Dr OLIHA to reach me when i thought it was all over, today i am happy with my life again after the medical doctor have confirmed my HERPES SIMPLEX VIRUS / HPV of 5 is gone,i have never in my life believed that HERPES SIMPLEX VIRUS could be cured by herbal medicine. so i want to use this means to reach other persons who have this disease by testifying the power of Dr OLIHA that all hope is not lost yet, try and contact him by any means for any kind of disease with his email: oliha.miraclemedicine@gmail.com add him on whatsapp line or call +2349038382931.
ReplyDelete
Replies
Lynne Allen28 August 2022 at 21:16
Hi Admin, I hope are fine. I'm your one of the biggest fans, I like to read your blogs and I've also shared your blogs with my family members and friends. I hope in future you'll publish more blogs like your old ones. Please accept my thanks on my behalf of me and from my family and friends.

1947 Housing Islamabad
Capital Smart City
Lahore Smart City
New Metro City Gujar Khan
Skip Hire Near Me
1947 Housing Islamabad
ReplyDelete
Replies

Add comment