Introduction
Illumina sequencing instruments (HiSeq, MiSeq, Genome Analyzer) can produce three main types of reads when sequencing genomic DNA:- Single-end
 Each "read" is a single sequence from one end of a DNA fragment. The fragment is usually 200-800bp long, with the amount being read can be chosen between 50 and 250 bp.
- Paired-end
 Each "read" is two sequences (a pair) from each end of the same genomic DNA fragment (more info). The distance between the reads on the original genome sequence is equal to the length of the DNA fragment that was sequenced (usually 200-800 bp).
- Mate-pair:
 Like paired-end reads, each "read" is two sequences from each end of the same DNA fragment, but the DNA fragment has been engineered from a circularization process (more info) such that the distance between the reads on the original genome sequence is much longer (say 3000-10000 bp) than the proxy DNA fragment (200-800 bp).
Single-end library ("SE")
When we got the original Illumina Genome Analyzer, all it could do was 36 bp single-end reads, and each lane gave us a massive 250 Mbp, and we had to walk 7 miles through snow in the dark to get it. Ok, that last bit is clearly false as we don't get snow in Australia and we speak metric here, but the point is that there is still plenty of legacy SE data around, and SE reads are still used in RNA-Seq sometimes. Let's imagine our data was provided as a standard FASTQ file called SE.fq:
velveth Dir 31 -short -fastq SE.fq
velvetg Dir -exp_cov auto -cov_cutoff auto
I strongly recommend enabling the -exp_cov auto and -cov_cutoff auto options. They will almost always improve the quality of your assemblies.
I strongly recommend enabling the -exp_cov auto and -cov_cutoff auto options. They will almost always improve the quality of your assemblies.
Paired-end library ("PE")
Paired-end reads are the standard output of most Illumina sequencers these days, currently 2x100bp for the HiSeq and 2x150bp for the GAIIx and MiSeq, but they are all migrating to 2x250bp soon. The two sequences per paired read are typically distributed in two separate files, the "1" file contains all the "left" reads and the "2" file contains all the corresponding "right" reads. Let's imagine our paired-end run gave us two files in standard FASTQ format, PE_1.fq and PE_2.fq:
velveth Dir 31 -shortPaired -separate -fastq PE_1.fq PE_2.fq
velvetg Dir -exp_cov auto -cov_cutoff auto
Previously you had to interleaved the left and right files for Velvet, but we recently added support to Velvet for the -separate option which we hope is now saving time and disk space throughout the Velvetsphere!
Mate-pair library ("MP")
Mate-pair reads are extremely valuable in a de novo setting as they provide long-range information about the genome, and can help link contigs together into larger scaffolds. They have been used reliably for years on the 454 FLX platform, but used less often on the Illumina platform. I think the main reasons for this are the poorer reliability of the Illumina mate-pair protocol and the larger amount of DNA required compared to a PE library.
We can consider MP reads as the same as PE reads, but with a larger distance between them ("insert size"). But there is one technical difference due to the circularization procedure used in their preparation. PE reads are oriented "opp-in" (L=>.....<=R), whereas MP reads are oriented "opp-out" (L<=.....=>R). Velvet likes its paired reads to be in opp-in orientation, so we need to reverse-complement all our MP reads first, which I do here with the EMBOSS "revseq" tool.
revseq -sequence MP_1.fq -outseq rcMP_1.fq -notag
revseq -sequence MP_2.fq -outseq rcMP_2.fq -notag
velveth Dir 31 -shortPaired -separate -fastq rcMP_1.fq rcMP_2.fq
velvetg Dir -exp_cov auto -cov_cutoff auto -shortMatePaired yes
Early Illumina MP libraries are often contaminated with PE reads (the so-called shadow library) which are the result of imperfect selection of biotin-marked fragments in the circularization process. There is a special option in velvetg (not velveth) called -shortMatePaired added by  Sylvain Foret which informs Velvet that a paired channel is MP but may contain  PE reads, which helps it to account for them and avoid mis-scaffolding. I recommend using this no matter how pure you think your MP library is.
Combining them all! (SE + PE + MP)
When de novo assembling multiple libraries in Velvet, you should order them from smallest insert size to largest insert size. For our case, this means the SE first, then the PE, then the MP. Each library must go in its own "channel" as it comes from a differently prepared DNA library. Channels are specified in Velvet with a numerical suffix on the read-type parameter (absence means channel 0):
velveth \
  Dir 31 \
  -short -fastq SE.fq \
  -shortPaired2 -separate -fastq PE_1.fq PE_2.fq \
  -shortPaired3 -separate -fastq rcMP_1.fq rcMP_2.fq 
velvetg \
  Dir \
  -exp_cov auto -cov_cutoff auto \
  -shortMatePaired3 yes
Note that the -shortMatePaired option has been allocated to channel 3 now (the -shortPaired3 library) as that is the MP channel.
Conclusions
It's relatively to get up and running with Velvet, but when your projects become more complicated, the methods in this post should help you. But if you prefer a nice GUI to take care of most of the issues discussed here, I recommend using our Velvet GUI called VAGUE (Velvet Assembler Graphical User Environment). 
 
That's a very informative post. Do you know how would you handle SOLiD reads? Reverse complimenting SOLiD reads is not the same as that for Illumina. Also, can you do an assembly with mate=paired reads alone?
ReplyDeleteIs VAGUE able to handle SOLiD data?
Great post Torsten. Do you have any actual stats on how much better this makes an assembly with real world data? PS. I cant wait for a decent miSEQ MP protocol, PE is rubbish for de novo assembly.
ReplyDeleteLinda, Velvet does a terrible job of SOLiD data. Your best option is to assemble a genome with illumina or 454 or torrent data, then overlay the SOLiD LMP data and use it to fix some errors, and to order your contigs.
De novo assembly of SOLiD colour space data is fraught with problems. All assembly software I know only works in base space. The SOLiD community site has details of a pipeline to convert/assemble/merge but I don't think it works very well.
ReplyDeleteYou are right in the the orientation of SOLiD mate pairs is different again, I think it is (L=> R=>).
Yes, you can do an assembly of just MP reads.
Ultimately de novo assembly works best with very long SE reads. PE and MP reads are a "hack" around this. The variability in the insert/fragment size makes linking PE/MP reads a bit tricky, although Pevzner is working on "rectangle graphs" for this sort of data.
I really like the '-separate' option! Is it true that this option is not documented in the manual?
ReplyDeleteThat is false. The -separate option is in the Velvet manual (Manual.pdf) since version 1.2.07.
ReplyDeletehttps://github.com/tseemann/velvet/raw/master/Manual.pdf
Just an FYI... The link you give above for the Velvet manual points to the August 29, 2008 version, which has no mention of the -separate flag.
DeleteHi Torsten,
ReplyDeleteThanks for the very informative post! It was just what (I thought) I needed. I'm trying to use Velvetoptimiser to assemble a small eukaryotic genome from two Illumia HiSeq libs- one PE and one MP.
Do you know if/how I can pass the Velvetg cmds. for the MP lib. as you did? It seems that one can pass most Velvet cmds. via Velvetoptimiser, but there is not a lot of documentation on doing something like this... at least I've yet to find it.
Thanks
Walt
Yes, you can pass through the velveth and velvetg options in this post to VelvetOptimiser with the -f and -o options. These are all described the the VOpt manual in the tarball you download. If you have any trouble, just email Simon Gladman with the command line that failed. His email is in the manual.
ReplyDeleteBut first see if you can assemble without VOpt for a single K value. If you can't get that working, there is no point putting it into VOpt.
Thanks, Toresten, that's exactly what I've been doing- assembling w/o VO. I also just tried to use the -separate flags (I had shuffled data before), but apparently the Velvet version on our cluster (1.2.01-62-mp) is way out of date and doesn't support that, which probably explains why I was having issues passing some cmds. with VO. For example, using -shortpaired2, which won't work, but -shortpaired3 will... go figure!
DeleteThanks again!
Hi Torsten, thanks for you informative post
ReplyDeleteone little thing:
when using revseq you have to specify both input and output format correctly and explicitely, otherwise (e.g. with your code) you end up with a fasta file...
revseq -sequence MP.fq -sformat1 fastq-sanger -outseq rc_MP.fq -osformat2 fastq-sanger -notag
would be correct for newer illumina data.
Cheers
Harald,
ReplyDeleteThanks for the information on EMBOSS revseq. To be honest, I don't use revseq - I use my own tool - but I wanted my post to be portable. I should have tested it properly beforehand.
Torsten.
This comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteHi,
ReplyDeleteSo I am trying to assemble a low complexity genome. The simulated paired end reads (illumina Miseq) which I am getting out of metasim shows a good genomic coverage. But it gives just one fasta file which I am using as an input for velveth.
My aim is to get a good contig length distribution which I am not getting. I guess I am doing something wrong in giving input parameters. Is it that velvet does not accept a single fasta file for paired read information? Do I have to input 2 different files (forward read and reverse read files)?
Thanks
Ashwani
Velvet can accept paired-end reads in two ways. The first is as separate left and right files (via the -separate option). The second is via a single interleaved/shuffled file (default).
ReplyDeleteA large low complexity genome will never give a good assembly using only paired end reads because there are too many repeats which can not be resolved.
Thanks Torsten...
DeleteHi Torsten,
DeleteIs it like velvet performs best with very short reads (say 36-75bp)?What about reads longer than 150bp, because when I vary the read length to a higher region (above 150bp), the assembly degrades.
Thanks
Ashwani
I use Velvet with 150bp and even 250bp MiSeq reads without any problems. With these length reads the best K-mer value is around K=100 rather than K=31 like it used to be with very short 36bp reads. But to achieve K=100, you still need to ensure your depth is high enough to make the K-mer coverage (not read coverage) high enough (50-100).
ReplyDeleteThanks Torsten!!
DeleteI am using velvet to assemble fungal genome (~30 million reads). I have compiled velvet with 'MAXKMERLENGTH=165' and 'openmp=1' to handle my 150bp reads. I am getting memory error ( velveth: Can't malloc 207 chars: Cannot allocate memory). Below is my script that shows the memory I requested.
Delete#PBS -q standard
#PBS -l jobtype=cluster_only
### Set the number of cpus that will be used.
#PBS -l select=1:ncpus=4:mem=8Gb:pcmem=4gb
#PBS -l pvmem=4gb
#PBS -l cput=30:0:0
#PBS -l walltime=24:0:0
source /usr/share/Modules/init/bash
module load unsupported
module load nirav/velvet/1.2.10
#Get things ready to use threaded (OMP) version of Velvet
#Set OMP_THREAD_LIMIT--will be the same as ppn above
export OMP_THREAD_LIMIT=4
#Set OMP_NUM_THREADS--will be 1 lower than ppn
NUM_THREADS=3
export OMP_NUM_THREADS=$NUM_THREADS
# Write something to record what thread limits were used.
echo Limiting Velvet to $PBS_NP threads total with $NUM_THREADS slave threads.
cd $PBS_O_WORKDIR/1st_assembly
velveth fungal 125,155,10 -fastq.gz -shortPaired -separate /rsgrps2/sjmiller/Illumina_JeeHan/160224_M03132_0042_000000000-ALYFG_Analysis_fastq_files/Parasite_S1_L001_R1_001.fastq.gz \
/rsgrps2/sjmiller/Illumina_JeeHan/160224_M03132_0042_000000000-ALYFG_Analysis_fastq_files/Parasite_S1_L001_R1_001.fastq.gz
I know I am requesting way too less memory but my question is what is appropriate memory that could handle my dataset?
Thanks,
M
This comment has been removed by the author.
DeleteHi Torsten,
ReplyDeleteAs stated by you I tried k mer value of around 100 and it worked for a metagenome (say 10 genomes). The meta velvet also gave good results. The number of reads in this case was 1 million. But as I increased the read number to > 4 million the assembly degrades, even if I use a very large k mer value. So apart from K mer is there any other parameter which needs to be varied?
Thanks
Ashwani
It is difficult to know what is going on with your data. I'm not really sure what you mean by "assembly degrades"?
ReplyDeleteAs you add more reads, more "lower proportion" organisms may pass the cutoffs that Velvet is using, and you will get more contigs.
Hi,
DeleteThanks for the reply!!
To be more specific I am aiming at a good contig length distribution. Say if I use 4 million reads, at k mer size around 200 (my read length is 220bp), I am getting a decent contig distribution but not a good one.
Now if I think of increasing the read count further, I am not sure what parameters apart from k mer size I should vary. How the k mer coverage in these scenarios affect contig length distribution?
Thanks
Ashwani
Just the info I was after - thanks!
ReplyDeleteWith regard to the question about longer kmer values though (ie ~100), won't the increased probability of encountering a sequencing error start to create problems in the assembly. There will also be less overlaps that fit into this big kmer value. With the exception of long repeats, I would have thought inappropriate overlaps greater than 50 bp would be rare, so not much is gained anyway. Hope that makes sense!
You are right that the probability of erroneous K-mers does increase with K. You can only increase K if you also increase your read coverage too, so that we get enough correct K-mers (signal) to counteract the false K-mers (noise). Read clipping and read correction are often used to drastically reduce the number of false K-mers being put into the graph, which reduces memory usage and improves graph simplification.
ReplyDeleteFor bacteria we usually have lots of coverage, sometimes > 1000x so using higher K-mers improves the assembly in practice. The error rate of Illumina is well below 1% now, so false K-mers are less likely than one might first expect (after clipping). Some studies have shown that K=24 is "mostly unique" in genomes, but mostly doesn't make them all unique.
Thanks for taking the time to comment.
Torst
Thanks, agreed - Cheers
DeleteIn which version there is the possibility of using the -separate option for paired-ends reads? in Version 1.2.03 I get an unknown option error for that.
ReplyDeleteHi Torsten, great post. One thing I'd point out is that Velvet does not compile with 3 categories (for 3 libraries) by default (it defaults to 2) so to use more than 2 libraries in an assembly, you need to compile velvet with the CATEGORIES option.
ReplyDeleteAs per manual: make 'CATEGORIES=57' (in the unlikely event you have 57 separate libraries to assemble)
DeleteHi Torsten,
ReplyDeleteThanks for the this informative post. Do you know any tool designed to trim (and reverse the reads if necessary) for adapters specifically for illumina mate-pair libraries ? Based on the junction adapter(s) present in the read MPs might have different orientation so reversing all the reads seems not the best option. I could write my own code for this but I wanted to check first to see if there is anything available.
Thanks
Hamdi
Hi Torsten,
ReplyDeleteThank you for such an informative post.
I am using metavelvet to assemble metagenome generated by Ion Proton, there sre 23 million SE reads.
I have used the following commands:
velveth runAll_a/ 21 -short ionExpress_nofilter.fasta
velvetg runAll_a/ -exp_cov auto
meta-velvetg runAll_a/
The last line of the output is:
Final graph has 4272026 nodes and n50 of 101, max 1932, total 178452348, using 19006699/23565054 reads
I am confused or rather surprised with this line. Isn't the contigs are too many.
Also, when I ran the python script "scriptEstimatedCovMulti.py" I am getting peak as[-1] and I am not sure what that is.
The coverage is also very low.
I would like to know if the method I am adopting is correct or I need to modify some steps to get improved results.
I will be thankful if any one could help me figure out if I am on right track.
Thanks!
Your result is not surprising. Velvet is not well suited to Proton data, it works best with Illumina. Also Velvet is not designed for meta-genomes, which have a mixture of species of different coverages.
DeleteYou should consider a more specific tool, eg. Meta-Velvet or Ray-Meta. The Khmer project also has a protocol for assembling meta-genomes: https://khmer-protocols.readthedocs.org/en/latest/metagenomics/
Thank you for the reply.
DeleteI have used Meta-Velvet here at the last step, but the results or the contigs which are produced after it and after velvetg are same in number.
I tried the same approach with Miseq data but it looked same.
I will try Ray-Meta.
Thanks!
Hi Torsten,
ReplyDeleteThanks for the this informative post.
I now using hybrid data of pair-end and mate-pair reads to assemble. I want to perform error correct before assemble. Should I reverse complement the mate-pair reads before I correct the error for all of my data. Or can I correct them and then reverse complement the mate-pair data before assemble?
Trim the reads first, then correct the reads, then reverse complement for the assembler last.
DeleteOk, thanks a lot!
DeleteHi Torsten,
ReplyDeleteI was wondering how using PE and MP reads of different lengths would affect the best k-mer size? If you have, say, a best k-mer size for the PE reads that is longer than the MP read length.
Most modern assemblers no longer use a single k-mer value. They combine assemblies of multiple k-mer values.
DeleteYou may want to assemble just the PE reads first into contigs, then use an external scaffolder to use the MP reads to join and order contigs.
Hi Torsten,
ReplyDeleteI started to do velvet recently, i want to know all the basic things about Velvet. i hate the official manual which one is i downloaded from the Velevt official page. Could you pls sugeest me any of the Articles and Velvet manual...
Best,
K.Kathir
Hi Torsten,
ReplyDeleterecently i was trying to assemble my genome with 3 libraries,and how to determin the -ins_length option if i want to assemble with all the libraries? Is it necessary ao add this option to improve my assembly results? if i assemble with separate library ,how could i combine these results into one ?
Looking forward to your reply
Hello Torsten,
ReplyDeleteYou are really doing a wonderful work by writing here . It really helps the beginners in de novo assembly.Here, i have one question that why all 3 libraries (SE reads, PE reads and Mate PE reads) of an organism should be combined and assembled? Combining 3 libraries will improve the assembly quality? could you answer me this?
De novo genome assembly would be easy if there were no repeats in the genome, and no errors in the reads. We can improve error rate with better technology, but we can never change the genomes. Repeats produce ambiguous hubs in the assembly graph. The only way to untangle a repeat of length L is to have reads, or "read spans" that are LONGER than L bp. That's why Mate Pair reads are used, as they span up to 40 kbp of the genome, wherease Paired End reads only reach ~600bp.
DeleteYou might like to read some of the slides I have given on this topic:
http://www.slideshare.net/torstenseemann/presentations
Could you provide a removal adapter for Mate pair library before use velvet
ReplyDeleteThanks!
It depends on which mate pair protocol and adaptors you used. Look at `nxtrim`.
DeleteAll thanks to Dr OLIHA for curing my herpes virus/hpv with his herbal medicine, i do not have much to say but with all my life i will forever be grateful to him and God Almighty for using Dr OLIHA to reach me when i thought it was all over, today i am happy with my life again after the medical doctor have confirmed my HERPES SIMPLEX VIRUS / HPV of 5 is gone,i have never in my life believed that HERPES SIMPLEX VIRUS could be cured by herbal medicine. so i want to use this means to reach other persons who have this disease by testifying the power of Dr OLIHA that all hope is not lost yet, try and contact him by any means for any kind of disease with his email: oliha.miraclemedicine@gmail.com add him on whatsapp line or call +2349038382931.
ReplyDeleteHi Admin, I hope are fine. I'm your one of the biggest fans, I like to read your blogs and I've also shared your blogs with my family members and friends. I hope in future you'll publish more blogs like your old ones. Please accept my thanks on my behalf of me and from my family and friends.
ReplyDelete1947 Housing Islamabad
Capital Smart City
Lahore Smart City
New Metro City Gujar Khan
Skip Hire Near Me
1947 Housing Islamabad
Thank you Dr lacoco for saving my relationship and brought my Lover back to me. Me and my Lover are living together happily again.. All thanks to Dr lacoco. If you have any problem contact Dr lacoco now, and I guarantee you that he will help you. Contact him via: Email: Drlacocospellhome@gmail.com WhatsApp +2347035833962
ReplyDelete