Wednesday, 21 November 2012

Cleaning Illumina FASTQ reads with Nesoni clip:


Cleaning FASTQ reads is the process of removing those bits of the reads that you don't deem good enough to be given to the next stage of your pipeline. At worst, this could mean removing the whole read, and if the reads were paired, this means some reads will become "orphan" single reads. The cleaning process is often called filtering, trimming, clipping, or pruning. A FASTQ read has three parts: a sequence of bases, and a quality score for each base, and a sequence ID. The cleaning process only has access to this information.


In terms of the sequence itself, we may want to discard sequences which contain ambiguous DNA symbols, most commonly "N" which means the base could not be called. Generally, reads with anything other than A,T,G,C are going to cause problems with most downstream analysis tools.

Preparing DNA for sequencing almost always involves ligating "adaptor" sequences to all the fragments of DNA. When things don't go perfectly, these adaptors can end up being sequenced, and be partially present at the start/end of your reads! This can wreak havoc with downstream analyses like read alignment and de novo assembly. This mainly seems to occur with Illumina reads, especially mate-pair libraries and some RNA-Seq protocols. You should always check and remove these adaptor sequences.


Usually when we think of cleaning FASTQ reads, we mean trimming on quality. Each base in a FASTQ read gets a Phred quality score which is an integer between 0 and 93 (as encoded in the Sanger FASTQ format) although you rarely see more than "Q45" or so. Low numbers are bad, high numbers are good. The numbers are on a logarithmic scale, and represent the probability of this base not being correct. For example, a base of Q30 is expected to be wrong about 1 in 1000 times, or we are 99.9% confident in it. This means that in the millions of Q30 bases throughout all your reads, thousands of them are expected to be wrong! When we get down to Q10, one in ten bases are dodgy. Like any analysis, the "garbage in, garbage out" rule applies, and hence removing low quality reads is often a good strategy, especially if you don't know how well the downstream tool handles errors.

In Illumina reads the quality tends to get monotonically worse at the 3' end of the reads. Simply removing bases below a quality threshold T from the end is a common form of quality trimming. Alternatively you could average quality across a small window (3 bases say) and threshold that. Taking that idea to the limit, you could simply remove whole reads with average quality < T. There are many variations on this theme, and later I will discuss the approach we take.


You probably think I'm crazy suggesting we can use the sequence ID to filter reads, right? In general, of course it does not make sense, however Illumina reads do contain some information in their read IDs. The most relevant are the series of colon-separated integers which some encode the coordinates of the cluster on the flowcell lane which generated this sequence. When we got the original Illumina Genome Analyzer (version 1!) it was clearly producing bad reads that came from the edges of the flowcell, due to poorer optical performance and other physical affects. You can still see this effect today when you open a FASTQ file, whereby lots of the reads at the start of the file are full of "N"s and have low quality - these come from the first tile in the corner of the flowcell lane. I'm not advocating using this information in reality, but it is interesting to keep in mind, and may actually be useful to someone if you were given a FASTA file where the quality information had been stripped away.

What is Nesoni?

Nesoni (github) is our Swiss Army knife of NGS related tools, implemented primarily by Paul Harrison in our group.  It began as a wrapper for aligning reads and calling SNPs for the 100s of bacterial genome samples we were processing, but has now evolved into an extendible, distributable pipeline system which we hope to publish soon and actually document. I'll save the full description of Nesoni for another day, and today just focus on one of the simplest but still very useful tools it has.

The Nesoni clip: tool

If you just type "nesoni clip:" you will get this descriptive help information:

    --adaptor-clip yes
      # Do adaptor clipping?
    --adaptor-file ...
      # FASTA file to read adaptors from. 
      # Defaults to a built-in list of Illumina adaptors.
    --match 10
      # Minimum length adaptor match.
    --max-errors 1
      # Maximum errors in adaptor match. Note: slow for N > 1
    --clip-ambiguous yes
      # Clip ambiguous bases, eg N
    --quality 10
      # Quality cutoff
    --qoffset NNN
      # Quality character offset: sanger: 33, solexa: 59, illumina: 64
      # Will guess if not given.
    --length 24
      # Reads shorter than this will be discarded.
    --homopolymers no
      # Set to yes to *remove* reads containing all the same base.
    --trim-to NNN
      # Trim ends of reads so that they are no longer than NNN bases,
      # irrespective of quality.
    --trim-start 0
      # Trim NNN bases from start of each read irrespective of quality.
    --trim-end 0
      # Trim NNN bases from end of each read irrespective of quality.
    --revcom no
      # Reverse complement all reads.
      # ie convert opp-out pairs to opp-in
    --fasta no
      # Output in fasta rather than fastq format.
    --gzip yes
      # Gzip output.
    --rejects no
      # Output rejected reads in separate file.
    --out-separate no
      # Output paired reads in separate left/right files.
      # Prefix for output files.
    reads: ...
      # Files containing unpaired reads.
    interleaved: ...
      # Files containing interleaved read pairs.
    pairs: ...
      # Pair of files containing read pairs.

It contains three main sections:

  1. The first is the standard getoptlong "--option" style options. 
  2. The second is the mandatory prefix for the the output files. 
  3. The third is the specification of all the input reads.


The standard Illumina paired-end run gives you two files for the left and right reads:


By default, Nesoni will quality clip at Q10, minimum length 24, and will remove all known Illumina adaptors from TruSeq amd Nextera:

nesoni clip: clipped pairs: Sample_R1.fastq.gz Sample_R2.fastq.gz

The "clipped" is the output file prefix which is mandatory. The "pairs:" clause tells it that you are feeding it separate left and right files. It can optionally take interleaved pairs with the "interleaved:" clause. By default, the output is three files - the clean pairs (interleaved), any orphan reads, and an extensive log file which explains how many reads failed the various filters (adaptors, quality, length)


You probably want your output as separate left/right files, and uncompressed, as to be compatible with most other software. Nesoni can do this with two extra options:

nesoni clip: clipped \
  --out-separate yes \
  --gzip no \
  pairs: Sample_R1.fastq.gz Sample_R2.fastq.gz

This gives four files:


 Imagine you had five libraries, some FASTQ, some FASTA, some GZIP, some not:

A_R1.fastq.gz and A_R2.fastq.gz
B_R1.fq and B_R2.fq

You can clip a whole bunch of mixed FASTQ files all in one command too. Just add more input file clauses like this:

nesoni clip: clipped \
  --out-separate yes \
  --gzip no \
  --length 50 \
  pairs: A_R1.fastq.gz A_R2.fastq.gz \
  pairs: B_R1.fq B_R2.fq \
  interleaved: C_interleaved.fasta \
  reads: D_single.fa \
  reads: E_proton.fq.gz 

This will process all the input files, and give a clean result combined into just four files:


Pretty neat. As a general first pass for 2 x 150bp HiSeq(RapidRun) or 2 x 250bp MiSeq I use the following options:

nesoni clip: clipped --length 100 --quality 20 --out-separate yes --gzip no 

I hope this gave you a flavour of the flexibility of Nesoni's clip: module!



  • Tries to keep as much of every read as possible
  • Maintains proper left/right ordering in paired read output
  • Can output interleaved or separate left/right
  • Outputs orphan singletons and rejected reads
  • Supports gzip compression for input and output
  • Can feed it all your read files at once - multiple pair sets, singles, interleaved
  • Automatic detection of quality encoding (Sanger 33, Illumina 64, Solexa 59)
  • Extensive reporting on Illumina adaptors for "forensic" work


  • Not a single clean binary or script, need whole Nesoni package
  • Slower than a native compiled implementation would be
  • Not packaged as a .deb or .rpm for BioLinux 
  • Not in the Galaxy Toolshed


    Related software

    Sunday, 11 November 2012

    Tools to merge overlapping paired-end reads


    In very simple terms, current sequencing technology begins by breaking up long pieces of DNA into lots more short pieces of DNA. The resultant set of DNA is called a "library" and the short pieces are called "fragments". Each of  the fragments in the library are then sequenced individually and in parallel. There are two ways of sequencing a fragment - either just from one end, or from both ends of a fragment. If only one end is sequenced, you get a single read. If your technology can sequence both ends, you get a "pair" of reads for each fragment. These "paired-end" reads are standard practice on Illumina instruments like the GAIIx, HiSeq and MiSeq.

    Now, for single-end reads, you need to make sure your read length (L) is shorter than your fragment length (F) or otherwise the sequence will run out of DNA to read! Typical Illumina fragment libraries would use F ~ 450bp but this is variable. For paired-end reads, you want to make sure that F is long enough to fit two reads. This means you need F to be at least 2L. As L=100 or 150bp these days for most people, using F~450bp is fine, there is a still a safety margin in the middle.

    However, some things have changed in the Illumina ecosystem this year. Firstly, read lengths are now moving to >150bp on the HiSeq (and have already been on the GAIIx), and to >250bp on the MiSeq, with possibilities of longer ones coming soon! This means that the standard library size F~450bp has become too small, and paired end reads will overlap. Secondly, the new enyzmatic Nextera library preparation system produces a wide spread of F sizes compared to the previous TruSeq system. With Nextera, we see F ranging from 100bp to 900bp in the same library. So some reads will overlap, and others won't. It's starting to get messy.

    The whole point of paired-end reads is to get the benefit of longer reads without actually being able to sequence reads that long. A paired-end read (two reads of length L) from a fragment of length F, is a bit like a single-read of length F, except a bunch of bases in the middle of it are unknown, and how many of them there are is only roughly known (as libraries are only nominally of length F, each read will vary). This gives the reads a longer context, and this particularly helps in de novo assembly and in aligning more reads unambiguously to a reference genome. However, many software tools will get confused if you give them overlapping pairs, and if we could overlap them and turn them into longer single-end reads, many tools will produce better results, and faster.

    The tools

    Here is a list of tools which can do the overlapping procedure. I am NOT going to review them all here. I've used one tool (FLASH) to overlap some MiSeq 2x150 PE reads, and then assembled them using Velvet, and the merged reads produced a "better" assembly than with the paired reads. But that's it. I write this post to inform people of the problem, and to collate all the tools in one place to save others effort. Enjoy!

    Features to look for

    • Keeps original IDs in merged reads
    • Outputs the un-overlapped paired reads
    • Ability to strip adaptors first
    • Rescores the Phred qualities across the overlapped region
    • Parameters to control the overlap sensitivity
    • Handle .gz and .bz2 compressed files
    • Multi-threading support
    • Written in C/C++ (faster compiled) rather than Python/Perl (slower)

    Tuesday, 6 November 2012

    Sorting FASTQ files by sequence length

    The problem

    The FASTQ file format is used to store short sequence reads, most commonly those from Illumina. The reads are typically between 36 and 250 bp long, and typical data sets contain between 1M to 1B reads. Each entry in a FASTQ file contains three pieces of information: the sequence ID, the DNA sequence, and a quality string. I recently read through the documentation for the khmer package, and noticed a comment stating that digital normalisation would work better if the FASTQ reads were in order from longest to shortest.

    I googled for some FASTQ sorting information. I found some posts on sorting by the sequence itself to remove duplicates (Justin's blog), and some posts on sorting on ID to help with finding displaced pairs (BioStar), but nothing on length sorting. There were some posts related to sorting FASTA files (Joe Fass) but it read the whole file into RAM and didn't use a Schwartzian transform.

    Some solutions

    Our first attempt was to use the paste trick I blogged about previously, where in.fq contains about 10M clipped reads with lengths between 24 and 100 bp:

    cat in.fq 
    | paste - - - - 
    | perl -ne '@x=split m/\t/; unshift @x, length($x[1]); print join "\t",@x;'  
    | sort -n 
    | cut -f2- 
    | tr "\t" "\n"
    > out.fq

    This turns a 4-line FASTQ entry into a single tab separated line, adds a column with the length of each read, passes it to Unix sort, removes the length column, and converts it back into a FASTQ file. And it works faster than I thought it would. About 2 minutes for a 10M read file. The whole process is I/O bound, so the faster your disks the better. The sort usually uses /tmp so if that is on a different disk unit  ("spindle" for my more mature readers) to your input and/or output files, you'll get better performance. There are various chunk parameters etc you can change on sort, but it probably doesn't make too much difference.

    Unix sort breaks the input into chunks, sorts each of those, then merges all the chunks in the output (merge sort). Another possibility which avoids explicit sorting is to read through the input, and put reads into separate files based on their read length (a simplistic radix sort). We can do this because we have a finite number of read lengths. The final output simply concatenates all the bin files in order. Here is an implementation in Perl, which is about 2x faster on average:

    #!/usr/bin/env perl
    use strict;
    use warnings;
    use Fatal;
    use File::Temp qw(tempdir);

    my $dir = tempdir(CLEANUP=>1);
    my @line;
    my @fh;

    while ( ! eof () ) {
      $line[$_] = scalar(<>) for (0..3);
      my $len = length($line[1]) - 1;
      if ( not defined $fh[$len] ) {
        open $fh[$len], '>', "$dir/$len";
      print {$fh[$len]} @line;

    for my $len (0 .. $#fh) {
      next unless defined $fh[$len];
      system("cat", "$dir/$len");


    Here are some timing measurements across three files:

    10 million reads
    Bash real 1m42.407s user 1m57.863s sys 0m52.479s
    Perl real 0m51.698s user 0m35.626s sys 0m9.897s
    20 million reads
    Bash real 3m58.520s user 4m2.155s  sys 1m58.167s
    Perl real  1m39.824s user 1m12.765s sys 0m20.409s
    30 million reads
    Bash real 5m53.346s user 6m16.736s sys 2m51.483s
    Perl real  2m23.200s user 1m46.815s sys 0m30.754s

    On face value, it appears the shell version ("Bash") is behaving in an O(N.logN) fashion, whereas the Perl implementation seems to be O(N) linear, which is somewhat expected given it reads all the data, writes all the data, then writes it all again. More testing and thinking is needed.


    Sorting huge FASTQ files is feasible. If applications like diginorm become mainstream and sorting clipped FASTQ files make it work better, than this is handy to know. Nothing groundbreaking here, and I may have repeated other people's work, but I hope it is useful to somebody.