Introduction
Cleaning FASTQ reads is the process of removing those bits of the reads that you don't deem good enough to be given to the next stage of your pipeline. At worst, this could mean removing the whole read, and if the reads were paired, this means some reads will become "orphan" single reads. The cleaning process is often called filtering, trimming, clipping, or pruning. A FASTQ read has three parts: a sequence of bases, and a quality score for each base, and a sequence ID. The cleaning process only has access to this information.
Sequence
In terms of the sequence itself, we may want to discard sequences which contain
ambiguous DNA symbols, most commonly "N" which means the base could not be called. Generally, reads with anything other than A,T,G,C are going to cause problems with most downstream analysis tools.
Preparing DNA for sequencing almost always involves ligating "adaptor" sequences to all the fragments of DNA. When things don't go perfectly, these adaptors can end up being sequenced, and be partially present at the start/end of your reads! This can wreak havoc with downstream analyses like read alignment and de novo assembly. This mainly seems to occur with Illumina reads, especially mate-pair libraries and some RNA-Seq protocols. You should always check and remove these adaptor sequences.
Quality
Usually when we think of cleaning FASTQ reads, we mean trimming on quality. Each base in a FASTQ read gets a Phred quality score which is an integer between 0 and 93 (as encoded in the Sanger FASTQ format) although you rarely see more than "Q45" or so. Low numbers are bad, high numbers are good. The numbers are on a logarithmic scale, and represent the probability of this base not being correct. For example, a base of Q30 is expected to be wrong about 1 in 1000 times, or we are 99.9% confident in it. This means that in the millions of Q30 bases throughout all your reads, thousands of them are expected to be wrong! When we get down to Q10, one in ten bases are dodgy. Like any analysis, the "garbage in, garbage out" rule applies, and hence removing low quality reads is often a good strategy, especially if you don't know how well the downstream tool handles errors.
In Illumina reads the quality tends to get monotonically worse at the 3' end of the reads. Simply removing bases below a quality threshold T from the end is a common form of quality trimming. Alternatively you could average quality across a small window (3 bases say) and threshold that. Taking that idea to the limit, you could simply remove whole reads with average quality < T. There are many variations on this theme, and later I will discuss the approach we take.
In Illumina reads the quality tends to get monotonically worse at the 3' end of the reads. Simply removing bases below a quality threshold T from the end is a common form of quality trimming. Alternatively you could average quality across a small window (3 bases say) and threshold that. Taking that idea to the limit, you could simply remove whole reads with average quality < T. There are many variations on this theme, and later I will discuss the approach we take.
ID
You probably think I'm crazy suggesting we can use the sequence ID to filter reads, right? In general, of course it does not make sense, however Illumina reads do contain some information in their read IDs. The most relevant are the series of colon-separated integers which some encode the coordinates of the cluster on the flowcell lane which generated this sequence. When we got the original Illumina Genome Analyzer (version 1!) it was clearly producing bad reads that came from the edges of the flowcell, due to poorer optical performance and other physical affects. You can still see this effect today when you open a FASTQ file, whereby lots of the reads at the start of the file are full of "N"s and have low quality - these come from the first tile in the corner of the flowcell lane. I'm not advocating using this information in reality, but it is interesting to keep in mind, and may actually be useful to someone if you were given a FASTA file where the quality information had been stripped away.
What is Nesoni?
Nesoni (github) is our Swiss Army knife of NGS related tools, implemented primarily by Paul Harrison in our group. It began as a wrapper for aligning reads and calling SNPs for the 100s of bacterial genome samples we were processing, but has now evolved into an extendible, distributable pipeline system which we hope to publish soon and actually document. I'll save the full description of Nesoni for another day, and today just focus on one of the simplest but still very useful tools it has.
The Nesoni clip: tool
If you just type "nesoni clip:" you will get this descriptive help information:--adaptor-clip yes
# Do adaptor clipping?
--adaptor-file ...
# FASTA file to read adaptors from.
# Defaults to a built-in list of Illumina adaptors.
--match 10
# Minimum length adaptor match.
--max-errors 1
# Maximum errors in adaptor match. Note: slow for N > 1
--clip-ambiguous yes
# Clip ambiguous bases, eg N
--quality 10
# Quality cutoff
--qoffset NNN
# Quality character offset: sanger: 33, solexa: 59, illumina: 64
# Will guess if not given.
--length 24
# Reads shorter than this will be discarded.
--homopolymers no
# Set to yes to *remove* reads containing all the same base.
--trim-to NNN
# Trim ends of reads so that they are no longer than NNN bases,
# irrespective of quality.
--trim-start 0
# Trim NNN bases from start of each read irrespective of quality.
--trim-end 0
# Trim NNN bases from end of each read irrespective of quality.
--revcom no
# Reverse complement all reads.
# ie convert opp-out pairs to opp-in
--fasta no
# Output in fasta rather than fastq format.
--gzip yes
# Gzip output.
--rejects no
# Output rejected reads in separate file.
--out-separate no
# Output paired reads in separate left/right files.
<prefix>
# Prefix for output files.
reads: ...
# Files containing unpaired reads.
interleaved: ...
# Files containing interleaved read pairs.
pairs: ...
# Pair of files containing read pairs.
It contains three main sections:
- The first is the standard getoptlong "--option" style options.
- The second is the mandatory prefix for the the output files.
- The third is the specification of all the input reads.
Usage
The standard Illumina paired-end run gives you two files for the left and right reads:
Sample_R1.fastq.gz
Sample_R2.fastq.gz
By default, Nesoni will quality clip at Q10, minimum length 24, and will remove all known Illumina adaptors from TruSeq amd Nextera:
nesoni clip: clipped pairs: Sample_R1.fastq.gz Sample_R2.fastq.gz
The "clipped" is the output file prefix which is mandatory. The "pairs:" clause tells it that you are feeding it separate left and right files. It can optionally take interleaved pairs with the "interleaved:" clause. By default, the output is three files - the clean pairs (interleaved), any orphan reads, and an extensive log file which explains how many reads failed the various filters (adaptors, quality, length)
clipped_paired.fq.gz
clipped_single.fq.gz
clipped_log.txt
Sample_R2.fastq.gz
By default, Nesoni will quality clip at Q10, minimum length 24, and will remove all known Illumina adaptors from TruSeq amd Nextera:
nesoni clip: clipped pairs: Sample_R1.fastq.gz Sample_R2.fastq.gz
The "clipped" is the output file prefix which is mandatory. The "pairs:" clause tells it that you are feeding it separate left and right files. It can optionally take interleaved pairs with the "interleaved:" clause. By default, the output is three files - the clean pairs (interleaved), any orphan reads, and an extensive log file which explains how many reads failed the various filters (adaptors, quality, length)
clipped_paired.fq.gz
clipped_single.fq.gz
clipped_log.txt
You probably want your output as separate left/right files, and uncompressed, as to be compatible with most other software. Nesoni can do this with two extra options:
nesoni clip: clipped \
--out-separate yes \
--gzip no \
pairs: Sample_R1.fastq.gz Sample_R2.fastq.gz
This gives four files:
clipped_R1.fq
clipped_R2.fq
clipped_single.fq
clipped_log.txt
--out-separate yes \
--gzip no \
pairs: Sample_R1.fastq.gz Sample_R2.fastq.gz
This gives four files:
clipped_R1.fq
clipped_R2.fq
clipped_single.fq
clipped_log.txt
Imagine you had five libraries, some FASTQ, some FASTA, some GZIP, some not:
A_R1.fastq.gz and A_R2.fastq.gz
B_R1.fq and B_R2.fq
C_interleaved.fasta
D_single.fa
E_proton.fq.gz
You can clip a whole bunch of mixed FASTQ files all in one command too. Just add more input file clauses like this:
nesoni clip: clipped \
--out-separate yes \
--gzip no \
--length 50 \
pairs: A_R1.fastq.gz A_R2.fastq.gz \
pairs: B_R1.fq B_R2.fq \
interleaved: C_interleaved.fasta \
reads: D_single.fa \
reads: E_proton.fq.gz
This will process all the input files, and give a clean result combined into just four files:
clipped_R1.fq
clipped_R2.fq
clipped_single.fq
clipped_log.txt
clipped_R1.fq
clipped_R2.fq
clipped_single.fq
clipped_log.txt
nesoni clip: clipped --length 100 --quality 20 --out-separate yes --gzip no
I hope this gave you a flavour of the flexibility of Nesoni's clip: module!
Conclusions
Advantages
- Tries to keep as much of every read as possible
- Maintains proper left/right ordering in paired read output
- Can output interleaved or separate left/right
- Outputs orphan singletons and rejected reads
- Supports gzip compression for input and output
- Can feed it all your read files at once - multiple pair sets, singles, interleaved
- Automatic detection of quality encoding (Sanger 33, Illumina 64, Solexa 59)
- Extensive reporting on Illumina adaptors for "forensic" work
Disadvantages
- Not a single clean binary or script, need whole Nesoni package
- Slower than a native compiled implementation would be
- Not packaged as a .deb or .rpm for BioLinux
- Not in the Galaxy Toolshed