Wednesday, 21 November 2012

Cleaning Illumina FASTQ reads with Nesoni clip:


Cleaning FASTQ reads is the process of removing those bits of the reads that you don't deem good enough to be given to the next stage of your pipeline. At worst, this could mean removing the whole read, and if the reads were paired, this means some reads will become "orphan" single reads. The cleaning process is often called filtering, trimming, clipping, or pruning. A FASTQ read has three parts: a sequence of bases, and a quality score for each base, and a sequence ID. The cleaning process only has access to this information.


In terms of the sequence itself, we may want to discard sequences which contain ambiguous DNA symbols, most commonly "N" which means the base could not be called. Generally, reads with anything other than A,T,G,C are going to cause problems with most downstream analysis tools.

Preparing DNA for sequencing almost always involves ligating "adaptor" sequences to all the fragments of DNA. When things don't go perfectly, these adaptors can end up being sequenced, and be partially present at the start/end of your reads! This can wreak havoc with downstream analyses like read alignment and de novo assembly. This mainly seems to occur with Illumina reads, especially mate-pair libraries and some RNA-Seq protocols. You should always check and remove these adaptor sequences.


Usually when we think of cleaning FASTQ reads, we mean trimming on quality. Each base in a FASTQ read gets a Phred quality score which is an integer between 0 and 93 (as encoded in the Sanger FASTQ format) although you rarely see more than "Q45" or so. Low numbers are bad, high numbers are good. The numbers are on a logarithmic scale, and represent the probability of this base not being correct. For example, a base of Q30 is expected to be wrong about 1 in 1000 times, or we are 99.9% confident in it. This means that in the millions of Q30 bases throughout all your reads, thousands of them are expected to be wrong! When we get down to Q10, one in ten bases are dodgy. Like any analysis, the "garbage in, garbage out" rule applies, and hence removing low quality reads is often a good strategy, especially if you don't know how well the downstream tool handles errors.

In Illumina reads the quality tends to get monotonically worse at the 3' end of the reads. Simply removing bases below a quality threshold T from the end is a common form of quality trimming. Alternatively you could average quality across a small window (3 bases say) and threshold that. Taking that idea to the limit, you could simply remove whole reads with average quality < T. There are many variations on this theme, and later I will discuss the approach we take.


You probably think I'm crazy suggesting we can use the sequence ID to filter reads, right? In general, of course it does not make sense, however Illumina reads do contain some information in their read IDs. The most relevant are the series of colon-separated integers which some encode the coordinates of the cluster on the flowcell lane which generated this sequence. When we got the original Illumina Genome Analyzer (version 1!) it was clearly producing bad reads that came from the edges of the flowcell, due to poorer optical performance and other physical affects. You can still see this effect today when you open a FASTQ file, whereby lots of the reads at the start of the file are full of "N"s and have low quality - these come from the first tile in the corner of the flowcell lane. I'm not advocating using this information in reality, but it is interesting to keep in mind, and may actually be useful to someone if you were given a FASTA file where the quality information had been stripped away.

What is Nesoni?

Nesoni (github) is our Swiss Army knife of NGS related tools, implemented primarily by Paul Harrison in our group.  It began as a wrapper for aligning reads and calling SNPs for the 100s of bacterial genome samples we were processing, but has now evolved into an extendible, distributable pipeline system which we hope to publish soon and actually document. I'll save the full description of Nesoni for another day, and today just focus on one of the simplest but still very useful tools it has.

The Nesoni clip: tool

If you just type "nesoni clip:" you will get this descriptive help information:

    --adaptor-clip yes
      # Do adaptor clipping?
    --adaptor-file ...
      # FASTA file to read adaptors from. 
      # Defaults to a built-in list of Illumina adaptors.
    --match 10
      # Minimum length adaptor match.
    --max-errors 1
      # Maximum errors in adaptor match. Note: slow for N > 1
    --clip-ambiguous yes
      # Clip ambiguous bases, eg N
    --quality 10
      # Quality cutoff
    --qoffset NNN
      # Quality character offset: sanger: 33, solexa: 59, illumina: 64
      # Will guess if not given.
    --length 24
      # Reads shorter than this will be discarded.
    --homopolymers no
      # Set to yes to *remove* reads containing all the same base.
    --trim-to NNN
      # Trim ends of reads so that they are no longer than NNN bases,
      # irrespective of quality.
    --trim-start 0
      # Trim NNN bases from start of each read irrespective of quality.
    --trim-end 0
      # Trim NNN bases from end of each read irrespective of quality.
    --revcom no
      # Reverse complement all reads.
      # ie convert opp-out pairs to opp-in
    --fasta no
      # Output in fasta rather than fastq format.
    --gzip yes
      # Gzip output.
    --rejects no
      # Output rejected reads in separate file.
    --out-separate no
      # Output paired reads in separate left/right files.
      # Prefix for output files.
    reads: ...
      # Files containing unpaired reads.
    interleaved: ...
      # Files containing interleaved read pairs.
    pairs: ...
      # Pair of files containing read pairs.

It contains three main sections:

  1. The first is the standard getoptlong "--option" style options. 
  2. The second is the mandatory prefix for the the output files. 
  3. The third is the specification of all the input reads.


The standard Illumina paired-end run gives you two files for the left and right reads:


By default, Nesoni will quality clip at Q10, minimum length 24, and will remove all known Illumina adaptors from TruSeq amd Nextera:

nesoni clip: clipped pairs: Sample_R1.fastq.gz Sample_R2.fastq.gz

The "clipped" is the output file prefix which is mandatory. The "pairs:" clause tells it that you are feeding it separate left and right files. It can optionally take interleaved pairs with the "interleaved:" clause. By default, the output is three files - the clean pairs (interleaved), any orphan reads, and an extensive log file which explains how many reads failed the various filters (adaptors, quality, length)


You probably want your output as separate left/right files, and uncompressed, as to be compatible with most other software. Nesoni can do this with two extra options:

nesoni clip: clipped \
  --out-separate yes \
  --gzip no \
  pairs: Sample_R1.fastq.gz Sample_R2.fastq.gz

This gives four files:


 Imagine you had five libraries, some FASTQ, some FASTA, some GZIP, some not:

A_R1.fastq.gz and A_R2.fastq.gz
B_R1.fq and B_R2.fq

You can clip a whole bunch of mixed FASTQ files all in one command too. Just add more input file clauses like this:

nesoni clip: clipped \
  --out-separate yes \
  --gzip no \
  --length 50 \
  pairs: A_R1.fastq.gz A_R2.fastq.gz \
  pairs: B_R1.fq B_R2.fq \
  interleaved: C_interleaved.fasta \
  reads: D_single.fa \
  reads: E_proton.fq.gz 

This will process all the input files, and give a clean result combined into just four files:


Pretty neat. As a general first pass for 2 x 150bp HiSeq(RapidRun) or 2 x 250bp MiSeq I use the following options:

nesoni clip: clipped --length 100 --quality 20 --out-separate yes --gzip no 

I hope this gave you a flavour of the flexibility of Nesoni's clip: module!



  • Tries to keep as much of every read as possible
  • Maintains proper left/right ordering in paired read output
  • Can output interleaved or separate left/right
  • Outputs orphan singletons and rejected reads
  • Supports gzip compression for input and output
  • Can feed it all your read files at once - multiple pair sets, singles, interleaved
  • Automatic detection of quality encoding (Sanger 33, Illumina 64, Solexa 59)
  • Extensive reporting on Illumina adaptors for "forensic" work


  • Not a single clean binary or script, need whole Nesoni package
  • Slower than a native compiled implementation would be
  • Not packaged as a .deb or .rpm for BioLinux 
  • Not in the Galaxy Toolshed


    Related software

    Sunday, 11 November 2012

    Tools to merge overlapping paired-end reads


    In very simple terms, current sequencing technology begins by breaking up long pieces of DNA into lots more short pieces of DNA. The resultant set of DNA is called a "library" and the short pieces are called "fragments". Each of  the fragments in the library are then sequenced individually and in parallel. There are two ways of sequencing a fragment - either just from one end, or from both ends of a fragment. If only one end is sequenced, you get a single read. If your technology can sequence both ends, you get a "pair" of reads for each fragment. These "paired-end" reads are standard practice on Illumina instruments like the GAIIx, HiSeq and MiSeq.

    Now, for single-end reads, you need to make sure your read length (L) is shorter than your fragment length (F) or otherwise the sequence will run out of DNA to read! Typical Illumina fragment libraries would use F ~ 450bp but this is variable. For paired-end reads, you want to make sure that F is long enough to fit two reads. This means you need F to be at least 2L. As L=100 or 150bp these days for most people, using F~450bp is fine, there is a still a safety margin in the middle.

    However, some things have changed in the Illumina ecosystem this year. Firstly, read lengths are now moving to >150bp on the HiSeq (and have already been on the GAIIx), and to >250bp on the MiSeq, with possibilities of longer ones coming soon! This means that the standard library size F~450bp has become too small, and paired end reads will overlap. Secondly, the new enyzmatic Nextera library preparation system produces a wide spread of F sizes compared to the previous TruSeq system. With Nextera, we see F ranging from 100bp to 900bp in the same library. So some reads will overlap, and others won't. It's starting to get messy.

    The whole point of paired-end reads is to get the benefit of longer reads without actually being able to sequence reads that long. A paired-end read (two reads of length L) from a fragment of length F, is a bit like a single-read of length F, except a bunch of bases in the middle of it are unknown, and how many of them there are is only roughly known (as libraries are only nominally of length F, each read will vary). This gives the reads a longer context, and this particularly helps in de novo assembly and in aligning more reads unambiguously to a reference genome. However, many software tools will get confused if you give them overlapping pairs, and if we could overlap them and turn them into longer single-end reads, many tools will produce better results, and faster.

    The tools

    Here is a list of tools which can do the overlapping procedure. I am NOT going to review them all here. I've used one tool (FLASH) to overlap some MiSeq 2x150 PE reads, and then assembled them using Velvet, and the merged reads produced a "better" assembly than with the paired reads. But that's it. I write this post to inform people of the problem, and to collate all the tools in one place to save others effort. Enjoy!

    Features to look for

    • Keeps original IDs in merged reads
    • Outputs the un-overlapped paired reads
    • Ability to strip adaptors first
    • Rescores the Phred qualities across the overlapped region
    • Parameters to control the overlap sensitivity
    • Handle .gz and .bz2 compressed files
    • Multi-threading support
    • Written in C/C++ (faster compiled) rather than Python/Perl (slower)

    Tuesday, 6 November 2012

    Sorting FASTQ files by sequence length

    The problem

    The FASTQ file format is used to store short sequence reads, most commonly those from Illumina. The reads are typically between 36 and 250 bp long, and typical data sets contain between 1M to 1B reads. Each entry in a FASTQ file contains three pieces of information: the sequence ID, the DNA sequence, and a quality string. I recently read through the documentation for the khmer package, and noticed a comment stating that digital normalisation would work better if the FASTQ reads were in order from longest to shortest.

    I googled for some FASTQ sorting information. I found some posts on sorting by the sequence itself to remove duplicates (Justin's blog), and some posts on sorting on ID to help with finding displaced pairs (BioStar), but nothing on length sorting. There were some posts related to sorting FASTA files (Joe Fass) but it read the whole file into RAM and didn't use a Schwartzian transform.

    Some solutions

    Our first attempt was to use the paste trick I blogged about previously, where in.fq contains about 10M clipped reads with lengths between 24 and 100 bp:

    cat in.fq 
    | paste - - - - 
    | perl -ne '@x=split m/\t/; unshift @x, length($x[1]); print join "\t",@x;'  
    | sort -n 
    | cut -f2- 
    | tr "\t" "\n"
    > out.fq

    This turns a 4-line FASTQ entry into a single tab separated line, adds a column with the length of each read, passes it to Unix sort, removes the length column, and converts it back into a FASTQ file. And it works faster than I thought it would. About 2 minutes for a 10M read file. The whole process is I/O bound, so the faster your disks the better. The sort usually uses /tmp so if that is on a different disk unit  ("spindle" for my more mature readers) to your input and/or output files, you'll get better performance. There are various chunk parameters etc you can change on sort, but it probably doesn't make too much difference.

    Unix sort breaks the input into chunks, sorts each of those, then merges all the chunks in the output (merge sort). Another possibility which avoids explicit sorting is to read through the input, and put reads into separate files based on their read length (a simplistic radix sort). We can do this because we have a finite number of read lengths. The final output simply concatenates all the bin files in order. Here is an implementation in Perl, which is about 2x faster on average:

    #!/usr/bin/env perl
    use strict;
    use warnings;
    use Fatal;
    use File::Temp qw(tempdir);

    my $dir = tempdir(CLEANUP=>1);
    my @line;
    my @fh;

    while ( ! eof () ) {
      $line[$_] = scalar(<>) for (0..3);
      my $len = length($line[1]) - 1;
      if ( not defined $fh[$len] ) {
        open $fh[$len], '>', "$dir/$len";
      print {$fh[$len]} @line;

    for my $len (0 .. $#fh) {
      next unless defined $fh[$len];
      system("cat", "$dir/$len");


    Here are some timing measurements across three files:

    10 million reads
    Bash real 1m42.407s user 1m57.863s sys 0m52.479s
    Perl real 0m51.698s user 0m35.626s sys 0m9.897s
    20 million reads
    Bash real 3m58.520s user 4m2.155s  sys 1m58.167s
    Perl real  1m39.824s user 1m12.765s sys 0m20.409s
    30 million reads
    Bash real 5m53.346s user 6m16.736s sys 2m51.483s
    Perl real  2m23.200s user 1m46.815s sys 0m30.754s

    On face value, it appears the shell version ("Bash") is behaving in an O(N.logN) fashion, whereas the Perl implementation seems to be O(N) linear, which is somewhat expected given it reads all the data, writes all the data, then writes it all again. More testing and thinking is needed.


    Sorting huge FASTQ files is feasible. If applications like diginorm become mainstream and sorting clipped FASTQ files make it work better, than this is handy to know. Nothing groundbreaking here, and I may have repeated other people's work, but I hope it is useful to somebody.

    Saturday, 20 October 2012

    The joy of academic conference spam

    Once you get your names on a few papers, the influx of  academic spam increases. I probably get at least two of these spams a day in my inbox (not counting the ones that are diverted by Google's spam filter). I suspect those academics higher up the chain get even more. The most common ones are requests for conference attendance, session chairing, and reviewing; closely followed by international students and graduates looking for jobs or Ph.D. placements. They are nearly always totally unrelated to what I work on (computational genomics).

    I got a classic example today:

    Dear Torsten Seeman: 

    Umm, you spelled my surname incorrectly. Sorry for misleading you with the correct spelling in my email address, websites, blog, and Twitter account.

    The four sponsoring societies of DDW invite you to submit an abstract for presentation at DDW 2013 Scientific Sessions, to be held Saturday, May 18-Tuesday, May 21, 2013 in Orlando, FL. The DDW 2013 online abstract submission site is now open and will close on Saturday, December 1, 2012 at 9 p.m. ET. 

    So it's in "Orlando, FL". Because I am Australian, I do have some idea of what else is in the world, and can deduce it is in the USA. But you are ignorant if you think the whole of the world knows all your two-letter USPS state codes! eg. CA = California or Canada? And unlike you, I know Canada is a country and not just a French speaking state of the USA.

    <snip>DDW 2013 Abstract Submission Site: 
    Poster sessions and DDW programming will start on Saturday, May 18, 2013. If your abstract is accepted, it may be scheduled for presentation on Saturday, Sunday, Monday or Tuesday. 
    DDW Administration

    Hmm, you still haven't explained what the hell "DDW" is. Your email domain is giving me a hint, but that was the last line of the email. OK, I'll click on the link to find out - I can't help myself. "Digestive Disease Week" eh? Thanks for giving me the shits.

    [Report Spam] clicked.

    Monday, 8 October 2012

    Building a bioinformatics server on a budget in 2012


    As more labs get into the "next generation sequencing" game, they are finding themselves unwittingly entering into the "high performance computing" game too. Some lucky labs will be able to access a local or national HPC service, or collaborate with those who do, or pay for access to one like Amazon EC2. But ultimately any bioinformatician will want (and need) a "general server" to experiment, develop, and test on. This is particularly true in terms of large, fast storage which is usually in short supply or costly in HPC environments.


    With the small lab in mind, here I outline how, for about A$2500/US$2600/2000), you can build your own server with 6/12 cores, 64GB RAM, and 12TB RAID storage from affordable commodity parts:

    • CPU
      • Intel CORE i7 3930K - LGA 2011 
      • $580
      • The Intel CPUs are about 40% faster for the same clock rate than the equivalent AMD. This is a 3.2 Ghz 6 core CPU (12 threads) and can go to 3.8 Ghz when only using some of the cores. It has good RAM bandwidth too.
    • CPU Cooler
    • Motherboard
      • Gigabyte GA-X79-UP4 Intel Mainboard - LGA 2011 
      • $269
      • This motherboard was chosen as it takes the new LGA2011 CPU, has 8 DIMM slots, and 8 SATA ports, including 6.0Gbsp SATA3 ones which we will want for the SSD boot disk described later. We will save money by using Linux software RAID.
    • RAM
    • Root disk
      • OCZ Agility 3 Series 120GB 2.5" SSD Sata 6Gbs 
      • $170 (2 drives)
      • For the boot/root disk, we will use 120GB SSD instead of mechanical hard disk, for speed. Two disks in RAID1 (mirror) configuration. This would store the operating system, and any important databases eg. BLAST indices, SQL etc.
    • Data disk
      • Seagate Barracuda 3TB 
      • $924 (6 drives)
      • In bioinformatics you can never have enough disk space! Here I suggest using six 3TB 7200rpm drives in RAID6 arrangement for a total of 12TB storage. Reading will be fast, but writing a bit slower. We need to compromise on a budget.
    • Case
      • Antec P280 
      • $135
      • To hold all this stuff, this case is great value. It has good ventilation, which is crucial for keeping all the disks and the CPU cool.
    • Power supply
      • Antec TruePower New Series 650W 
      • $120
      • The case doesn't come with a power supply. I've chosen this one because it has 9 SATA power connectors for the disks, and some unneeded cable sets can be removed in a modular fashion.
    • Operating system
      • Ubuntu Server
      • $0
      • Ubuntu Server edition is pre-tuned for server settings as opposed to interactive desktop use. Because it is Debian based, you have access to far more packaged bioinformatics applications than Centos, including those in the BioLinux project.
    • Miscellaneous
      • Extra 120mm cooling fans (for the hard disk zones)
      • Extra SATA3 cables (ones with locking clips)
      • CPU thermal paste (the grey stuff)
      • PCIe graphics card (for the screen, any old one will do)


    For about A$2500/US$2600/2000 and a little bit of Linux know-how, you can build your own "high-end" server. It is missing some of the features of commercial servers from Dell etc (eg. hardware RAID controller, IPMI remote management, hot-swap disks) but it is much more affordable.

    Friday, 21 September 2012

    Problematic FASTQ output from Ion TorrentSuite 3.0

    Yesterday we got an email from a Nesoni user who said that the SHRiMP aligner was failing on his FASTQ data. Then again today we got another similar report from our trusted colleague Tim Stinear with the same issue. The evidence was mounting for a bug either in Nesoni or in the FASTQ file, rather than user error. Tim had recently upgraded his PGM Sequencer to Torrent Suite v3.0 (point zero release alarm bells!), and Nesoni saved the shrimp_log.txt file and it contained this:

    note: quality value format set to PHRED+33
    done r/hr r/core-hr
    The qv-offset might be set incorrectly! Currenty qvs are interpreted as PHRED+33 and a qv of 62 was observed. To disable this error, etiher set the offset correctly or disable this check (see README).

    Wow! The PGM has improved dramatically, calling some bases at Q62, that's better than 1 in 1 million error rate... Here's a breakdown of the Q values in the file:
    Symbol ASCII Q+33 Frequency
    + 43 10 1105774
    , 44 11 347753
    - 45 12 1167099
    . 46 13 673276
    / 47 14 137220
    0 48 15 225893
    1 49 16 1621714
    2 50 17 1731775
    3 51 18 2726736
    4 52 19 4280447
    5 53 20 4272951
    6 54 21 2556953
    7 55 22 7783535
    8 56 23 5153631
    9 57 24 2362016
    : 58 25 2406869
    ; 59 26 2517450
    < 60 27 5762153
    = 61 28 4334469
    > 62 29 7109066
    ? 63 30 11113780
    @ 64 31 13934507
    A 65 32 9227417
    B 66 33 12758868
    C 67 34 8228985
    D 68 35 9935410
    E 69 36 1459950
    F 70 37 2358692
    G 71 38 682190
    H 72 39 158322
    I 73 40 168311
    J 74 41 269
    K 75 42 199121
    L 76 43 83457
    M 77 44 4971
    N 78 45 464143
    O 79 46 8657
    P 80 47 0
    Q 81 48 0
    R 82 49 0
    S 83 50 0
    T 84 51 0
    U 85 52 0
    V 86 53 0
    W 87 54 0
    X 88 55 0
    Y 89 56 0
    Z 90 57 0
    [ 91 58 0
    \ 92 59 0
    ] 93 60 0
    ^ 94 61 0
    _ 95 62 746
    Ok, there really are 746 bases with Q62. Some grep work shows me they are all occuring alone in 746 reads, all in position 1 in the read, like this:

    The table also shows quite a few >Q41 bases, which aren't typically seen in FASTQ files. These ones are probably ok (?) but the Q62 ones surely must be some artifact or bug in the version 3.0 of Torrent Suite.
    In the meantime, our solution has been this:
    sed 's/^_/H/' < dodgy.fastq > fixed.fastq
    Be interested to see if others have encountered this problem.

    Saturday, 8 September 2012

    Using Velvet with mate-pair sequences


    Illumina sequencing instruments (HiSeq, MiSeq, Genome Analyzer) can produce three main types of reads when sequencing genomic DNA:
    1. Single-end
      Each "read" is a single sequence from one end of a DNA fragment. The fragment is usually 200-800bp long, with the amount being read can be chosen between 50 and 250 bp.
    2. Paired-end
      Each "read" is two sequences (a pair) from each end of the same genomic DNA fragment (more info). The distance between the reads on the original genome sequence is equal to the length of the DNA fragment that was sequenced (usually 200-800 bp).
    3. Mate-pair:
      Like paired-end reads, each "read" is two sequences from each end of the same DNA fragment, but the DNA fragment has been engineered from a circularization process (more info) such that the distance between the reads on the original genome sequence is much longer (say 3000-10000 bp) than the proxy DNA fragment (200-800 bp).

    Single-end library ("SE")

    When we got the original Illumina Genome Analyzer, all it could do was 36 bp single-end reads, and each lane gave us a massive 250 Mbp, and we had to walk 7 miles through snow in the dark to get it. Ok, that last bit is clearly false as we don't get snow in Australia and we speak metric here, but the point is that there is still plenty of legacy SE data around, and SE reads are still used in RNA-Seq sometimes. Let's imagine our data was provided as a standard FASTQ file called SE.fq:

    velveth Dir 31 -short -fastq SE.fq
    velvetg Dir -exp_cov auto -cov_cutoff auto

    I strongly recommend enabling the -exp_cov auto and -cov_cutoff auto options. They will almost always improve the quality of your assemblies.

    Paired-end library ("PE")

    Paired-end reads are the standard output of most Illumina sequencers these days, currently 2x100bp for the HiSeq and 2x150bp for the GAIIx and MiSeq, but they are all migrating to 2x250bp soon. The two sequences per paired read are typically distributed in two separate files, the "1" file contains all the "left" reads and the "2" file contains all the corresponding "right" reads. Let's imagine our paired-end run gave us two files in standard FASTQ format, PE_1.fq and PE_2.fq:

    velveth Dir 31 -shortPaired -separate -fastq PE_1.fq PE_2.fq
    velvetg Dir -exp_cov auto -cov_cutoff auto

    Previously you had to interleaved the left and right files for Velvet, but we recently added support to Velvet for the -separate option which we hope is now saving time and disk space throughout the Velvetsphere!

    Mate-pair library ("MP")

    Mate-pair reads are extremely valuable in a de novo setting as they provide long-range information about the genome, and can help link contigs together into larger scaffolds. They have been used reliably for years on the 454 FLX platform, but used less often on the Illumina platform. I think the main reasons for this are the poorer reliability of the Illumina mate-pair protocol and the larger amount of DNA required compared to a PE library.

    We can consider MP reads as the same as PE reads, but with a larger distance between them ("insert size"). But there is one technical difference due to the circularization procedure used in their preparation. PE reads are oriented "opp-in" (L=>.....<=R), whereas MP reads are oriented "opp-out" (L<=.....=>R). Velvet likes its paired reads to be in opp-in orientation, so we need to reverse-complement all our MP reads first, which I do here with the EMBOSS "revseq" tool.

    revseq -sequence MP_1.fq -outseq rcMP_1.fq -notag
    revseq -sequence MP_2.fq -outseq rcMP_2.fq -notag
    velveth Dir 31 -shortPaired -separate -fastq rcMP_1.fq rcMP_2.fq
    velvetg Dir -exp_cov auto -cov_cutoff auto -shortMatePaired yes

    Early Illumina MP libraries are often contaminated with PE reads (the so-called shadow library) which are the result of imperfect selection of biotin-marked fragments in the circularization process. There is a special option in velvetg (not velveth) called -shortMatePaired added by  Sylvain Foret which informs Velvet that a paired channel is MP but may contain  PE reads, which helps it to account for them and avoid mis-scaffolding. I recommend using this no matter how pure you think your MP library is.

    Combining them all! (SE + PE + MP)

    When de novo assembling multiple libraries in Velvet, you should order them from smallest insert size to largest insert size. For our case, this means the SE first, then the PE, then the MP. Each library must go in its own "channel" as it comes from a differently prepared DNA library. Channels are specified in Velvet with a numerical suffix on the read-type parameter (absence means channel 0):

    velveth \
      Dir 31 \
      -short -fastq SE.fq \
      -shortPaired2 -separate -fastq PE_1.fq PE_2.fq \
      -shortPaired3 -separate -fastq rcMP_1.fq rcMP_2.fq 

    velvetg \
      Dir \
      -exp_cov auto -cov_cutoff auto \
      -shortMatePaired3 yes

    Note that the -shortMatePaired option has been allocated to channel 3 now (the -shortPaired3 library) as that is the MP channel.


    It's relatively to get up and running with Velvet, but when your projects become more complicated, the methods in this post should help you. But if you prefer a nice GUI to take care of most of the issues discussed here, I recommend using our Velvet GUI called VAGUE (Velvet Assembler Graphical User Environment). 

    Tuesday, 24 July 2012

    Navigating microbial genomes on the NCBI FTP site

    If you are a bioinformatician working in microbial genomics, then you should know this URL:

    If you click on the URL, there is a big list of folders, and it does look like a mess. But for those of us in microbial genomics there are a few key folders you should know about, and probably even have mirrored on your own servers:
    Most of my work is in bacterial genomics, so I'll discuss the contents of the first four folders only. I'll leave the last two to an experienced mycogenomicist.

    1. Bacteria

    This directory contains a folder for each completed bacterial genome. That is, the genome has been finished to a single DNA sequence per replicon (usually just one chromosome) and is fully annotated. There are currently around 1000 completed bacterial genomes, of which I've been involved in about 10.

    Let's have a look at one. I chose Dickeya dadantii because it's a lovely sounding alliteration for a plant pathogen:

    NC_014500.asn 15.9 MB 13/06/2012 12:11:00
    NC_014500.faa 1.7 MB 13/06/2012 12:11:00
    NC_014500.ffn 4.5 MB 19/11/2011 11:00:00
    NC_014500.fna 4.8 MB 29/09/2010 10:00:00
    NC_014500.frn 49.1 kB 29/09/2010 10:00:00
    NC_014500.gbk 16.7 MB 13/06/2012 12:11:00
    NC_014500.gff 1.8 MB 03/04/2012 03:41:00
    NC_014500.ptt 407 kB 10/03/2012 13:18:00
    NC_014500.rnt 7.1 kB 29/09/2010 10:00:00
    NC_014500.rpt 281 B 25/04/2011 10:00:00
    NC_014500.val 7.0 MB 13/06/2012 12:11:00

    You can see a bunch of files, all with the same prefix (NC_104500) and a bunch of different suffixes or file extensions (gbk, gff) - some of which should be familiar to you. The NC_014500 is the RefSeq accession ID for the single chromosome of Dickeya dadantii. The most important files are:
    • fna : FASTA file of the chromosomal sequence (think "n" = nucleotide)
    • gbk : Genbank file containing meta-data, sequence, and annotations
    • gff : GFF3 file containing annotations only (coordinates relative to the .fna file)
    • faa : FASTA file of the translated coding regions (proteins) annotated in the .gbk/.gff (think "aa" = amino acids)
    In terms of usefulness, the .gbk file contains (nearly) all the information that the other files contain - the .faa and .fna files are easily generated from the .gbk using BioPerl etc. If you want to get the .gbk files for all the finished genomes, you can download the tarball NCBI provides:

    2. Bacteria_DRAFT

    This directory contains folders for each draft bacterial genome. That is, the genome has been de novo assembled into contigs/scaffolds (eg. using Newbler for 454 data) but has not been, and probably never will be, finished. They are usually annotated, either by the submitter or automatically by NCBI, but sometimes there may be only sequences. There is about 2600 draft genomes currently.

    Here's the contents of the Thiocapsa marina str. 5811 genome folder - it's a purple sulphur coccus from the Mediterranean Coast if you are interested.

    NZ_AFWV00000000.asn        13.5 kB 03/04/2012 03:19:00
    NZ_AFWV00000000.contig.asn.tgz 1.7 MB 21/07/2012 02:13:00
    NZ_AFWV00000000.contig.faa.tgz 1.0 MB 21/07/2012 02:13:00
    NZ_AFWV00000000.contig.ffn.tgz 1.5 MB 21/07/2012 02:13:00
    NZ_AFWV00000000.contig.fna.tgz 1.6 MB 21/07/2012 02:13:00
    NZ_AFWV00000000.contig.frn.tgz 4.1 kB 21/07/2012 02:13:00
    NZ_AFWV00000000.contig.gbk.tgz 4.6 MB 21/07/2012 02:13:00
    NZ_AFWV00000000.contig.gbs.tgz 4.2 kB 21/07/2012 02:13:00
    NZ_AFWV00000000.contig.gff.tgz 393 kB 21/07/2012 02:13:00
    NZ_AFWV00000000.contig.ptt.tgz 119 kB 21/07/2012 02:13:00
    NZ_AFWV00000000.contig.rnt.tgz 1.5 kB 21/07/2012 02:13:00
    NZ_AFWV00000000.contig.rpt.tgz 2.5 kB 21/07/2012 02:13:00
    NZ_AFWV00000000.contig.val.tgz 1.6 MB 21/07/2012 02:13:00
    NZ_AFWV00000000.gbk         4.7 kB 03/04/2012 03:19:00
    NZ_AFWV00000000.rpt         257 B 03/04/2012 03:19:00
    NZ_AFWV00000000.val         6.0 kB 03/04/2012 03:19:00

    This folder looks a bit different to the finished genomes. It has a .gbk file, but you will notice it is quite small (4700 bytes), and if you look at it, you can see it has no sequence or annotation, only some meta-data and a reference to "WGS  NZ_AFWV01000001-NZ_AFWV01000062".This means that this genome record consist of 62 other records; one for each contig in the assembly. These are stored in the compressed tar file NZ_AFWV00000000.contig.gbk.tgz as follows:

    % tar ztf NZ_AFWV00000000.contig.gbk.tgz

    So, in summary, instead of getting a nice neat single .gbk or .faa file for each replicon as you do for the completed genomes, you get a tarball of files for each assembly, with each file representing a contig in the draft genome. Any extra chromosomes or plasmids will be mixed in the bag of contigs.

    3. Plasmids

    The plasmids folder is not known to many people, it seems a bit hidden away frankly. It contains ~3000 completed plasmid sequences. Confusingly, ~1000 of these are duplicated from the Bacteria folder (as the plasmid was sequenced with its parent), while the other ~2000 are novel. Even more annoying is that the folder structure is different:

    faa/ 21/07/2012 19:39:00
    fna/ 21/07/2012 19:40:00
    gbk/ 21/07/2012 19:41:00
    plasmids.all.faa.tar.gz 43.2 MB 23/07/2012 19:43:00
    plasmids.all.fna.tar.gz 75.1 MB 23/07/2012 19:43:00
    plasmids.all.gbk.tar.gz 199 MB 23/07/2012 19:43:00

    Now we have a folder for each file extension, which each contains 3000 files. So the files for a particular plasmid are spread out over multiple folders. Fortunately they provide compressed tar files of the whole archive to download directly:  plasmids.all.gbk.tar.gz

    4. Viruses

    Some of you may be wondering why I am including Viruses in this story. Well, some viruses infect Bacteria too - they are called bacteriophage. There are ~3000 folders in the Viruses division, but not all of them are bacteriophage. A simple grep for "phage" suggests ~600 are bacterial viruses.  The folder structure is the same as for the finished Bacteria genomes.

    It is important to realise that most of these virus sequences are natively dsDNA and will also appear integrated into the chromosomal DNA of many of the entries in Bacteria and Bacteria_DRAFT. 

    Saturday, 9 June 2012

    Recent changes in the Velvet de novo ecosystem

    A few weeks ago we managed to lure an old colleague, Dave Powell, back into academia to join the VBC team. One of his many talents is software engineering, so I immediately got him working on some projects related to the Velvet de novo assembler that I really wanted to do but could never find time.

    One of the plans was to write a GUI for Velvet. Browsing sourceforge and the like, it seems a few people had attempted and failed. Last year I supervised a 4th year software engineering group project, who managed to get a basic prototype working, but the code was unmaintainable. So Dave and I planned it out, and he had a prototype done by the next day. He implemented it in JRuby, which can be distributed as a standard portable Java .jar file. We called it VAGUE.

    However, there are few issues with the current command line Velvet that was limiting the usefulness of VAGUE to its beginner target audience:

    1. Paired-end read files have to be interleaved first, even though Illumina produce them as separate files
    2. The user needed to know what format they were in eg. fastq, fasta
    3. It only supports GZIP compression, even though a lot of sequencing centres are using BZIP2
    4. People don't know what K value to start with
    We could have put these features into VAGUE, but it seemed much more sensible to put them directly into Velvet itself. So that's what Dave did! With a bit of code refactoring, Velvet can now do these things.
    • velveth now has a -separate option for paired-read files (default is -interleaved):
      velveth dir 31 -shortPaired -fastq -separate reads_R1.fq reads_R2.fq

    • velveth now has a -fmtAuto option which auto-detects file-type and compression-type. Note that each file can be of different format and compression - very elegant.
      velveth dir 31 -shortPaired -separate -fmtAuto left.fastq.gz right.fa.bz2

    • velveth now supports BZIP2 compressed files, and if you have pbzip2 (parallel bzip) installed it will use that, and if you have pigz (parallel gzip) it will also use that
    For the issue of choosing K, I wrote a very simple Perl script which actually counts the K-mers in your actual reads and tells you the K-mer coverage for your target genome for all possible values of K.  It's called VelvetK and it has a --best option which we use in VAGUE to automatically choose a reasonable K-mer value to assemble with. 

    I think these new features and tools will help introduce Velvet and de novo assembly to more people, and hopefully help move power to the biologists and give more time to the bioinformaticians to develop better tools.

    Sunday, 20 May 2012

    Cool use of Unix paste with NGS sequences

    While browsing  today I came across a post where Uwe Appelt provided a couple of lines of Unix shell wizadry to solve some problem. What attracted my attention was the following:
    paste - - - - < in.fq | filter | tr "\t" "\n" > out.fq
    Now, I've done a reasonable amount of shell one-liners in my life, but I'd never seen this before.  I've used the paste command a couple of times, but clearly its potential power did not sink in! Here is the man page description for paste:
    Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output. With no FILE, or when FILE is -, read standard input.
    So what's happening here? Well, in Unix, the "-" character means to use STDIN instead of a filename. Here Uwe is providing paste with four filenames, each of which is the same stdin filehandle. So lines 1..4 of input.fq are put onto one line (with tab separator), and lines 5..8 on the next line and so on. Now, our stream has the four lines of FASTQ entry on a single line, which makes it much more amenable to Unix line-based manipulation, represented by filter in my example. Once that's all done, we need to put it back into the standard 4-line FASTQ format, which is as simple as converting the tabs "\t" back to newlines "\n" with the tr command.

    Example 1: FASTQ to FASTA

    A common thing to do is convert FASTQ to FASTA, and we don't always have our favourite tool or script to to this when we aren't on our own servers:

    paste - - - - < in.fq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > out.fa

    1. paste converts the input FASTQ into a 4-column file
    2. cut command extracts out just column 1 (the ID) and column 2 (the sequence)
    3. sed replaces the FASTQ ID prefix "@" with the FASTA ID prefix ">"
    4. tr conversts the 2 columns back into 2 lines

    And because the shell command above uses a pipe connecting four commands (paste, cut, sed, tr) the operating system will run them all in parallel, which will make it run faster assuming your disk I/O can keep up. 

    Example 2: Removing redundant FASTQ ID in line 3

    The third line in the FASTQ format is somewhat redundant - it is usually a duplicate of the first line, except with "+" instead of "@" to denote that a quality string is coming next rather than an ID. Most parsers ignore it, and happily accept a blank ID after the "+", which saves a fair chunk of disk space. If you have legacy files with the redundant IDs and want to conver them, here's how we can do it with our new paste trick:

    paste -d ' ' - - - - | sed 's/ +[^ ]*/ +/' | tr " " "\n"
    1. paste converts the input FASTQ into a 4-column file, but using SPACE instead of TAB as the separator character
    2. sed finds and replaces the "+DUPE_ID" line with just a "+"
    3. tr conversts the 4 columns back into 4 lines

    That's it for today, hope you learnt something, because I certainly did.

    Saturday, 28 April 2012

    Prokka - rapid prokaryotic annotation

    Prokka is a software tool I have written to annotate bacterial, archaeal and viral genomes. It is based on years of experience annotating bacterial genomes, both automatically and via manual curation. 

    It's main design considerations were to be:
    • fast
      • supports multi-threading
      • hierarchical search databases
    • simple to use
      • no compulsory parameters
      • bundled databases
    • clean
      • standards-compliant output files
      • pipeline-friendly interface
    • thorough
      • finds tRNA, rRNA, CDS, sig_peptide, tandem repeats, ncRNA
      • includes /gene and /EC_number where possible, not just /product
      • traceable annotation sources via /inference tags
    • useful
      • produce files close-to-ready for submission to Genbank
      • complete log file

    The first release is a monolithic, but followable Perl script. It only uses core Perl modules, but has quite a few external tool dependencies, some of which I can't bundle due to licence restrictions. Eventually I hope to have a public web-server version, and a version of it in the Galaxy Toolshed. 

    It currently takes about 10 minutes on a quad Intel i7 for a typical 4 Mbp genome.

    You can download it from here and read the manual here.

    Tuesday, 20 March 2012

    LEGO Ion PGM sequencer

    A while back I came across a cool LEGO model of a Hi-Seq 2000 sequencing instrument created by Dawei Lin at the UC Davis Genome Centre.

    My good colleague Tim Stinear was one of the first Australian labs to purchase an Ion Torrent PGM sequencing instrument, which looks like this:

    Now, my oldest child Oskar just turned 6 years old, and he was ready for LEGO. Being a former LEGO nerd, my nostolgia went into overdrive and I "invested" in some generic LEGO sets for his and Zoe's  Xmas presents, subsequently supplemented by some Ebay bulk lots. So now I had an excuse to lie on the rug all day playing with LEGO.

    And here is my first attempt at a LEGO model of a PGM:

    Not perfect or quite to proportions, but not a bad effort I reckon!