The Genome Factory: Sorting FASTQ files by sequence length

The problem

The FASTQ file format is used to store short sequence reads, most commonly those from Illumina. The reads are typically between 36 and 250 bp long, and typical data sets contain between 1M to 1B reads. Each entry in a FASTQ file contains three pieces of information: the sequence ID, the DNA sequence, and a quality string. I recently read through the documentation for the khmer package, and noticed a comment stating that digital normalisation would work better if the FASTQ reads were in order from longest to shortest.

I googled for some FASTQ sorting information. I found some posts on sorting by the sequence itself to remove duplicates (Justin's blog), and some posts on sorting on ID to help with finding displaced pairs (BioStar), but nothing on length sorting. There were some posts related to sorting FASTA files (Joe Fass) but it read the whole file into RAM and didn't use a Schwartzian transform.

Some solutions

Our first attempt was to use the paste trick I blogged about previously, where in.fq contains about 10M clipped reads with lengths between 24 and 100 bp:

cat in.fq
| paste - - - -
| perl -ne '@x=split m/\t/; unshift @x, length($x[1]); print join "\t",@x;'
| sort -n
| cut -f2-
| tr "\t" "\n"
> out.fq

This turns a 4-line FASTQ entry into a single tab separated line, adds a column with the length of each read, passes it to Unix sort, removes the length column, and converts it back into a FASTQ file. And it works faster than I thought it would. About 2 minutes for a 10M read file. The whole process is I/O bound, so the faster your disks the better. The sort usually uses /tmp so if that is on a different disk unit ("spindle" for my more mature readers) to your input and/or output files, you'll get better performance. There are various chunk parameters etc you can change on sort, but it probably doesn't make too much difference.

Unix sort breaks the input into chunks, sorts each of those, then merges all the chunks in the output (merge sort). Another possibility which avoids explicit sorting is to read through the input, and put reads into separate files based on their read length (a simplistic radix sort). We can do this because we have a finite number of read lengths. The final output simply concatenates all the bin files in order. Here is an implementation in Perl, which is about 2x faster on average:

#!/usr/bin/env perl
use strict;
use warnings;
use Fatal;
use File::Temp qw(tempdir);

my $dir = tempdir(CLEANUP=>1);
my @line;
my @fh;

while ( ! eof () ) {
$line[$_] = scalar(<>) for (0..3);
my $len = length($line[1]) - 1;
if ( not defined $fh[$len] ) {
open $fh[$len], '>', "$dir/$len";
}
print {$fh[$len]} @line;
}

for my $len (0 .. $#fh) {
next unless defined $fh[$len];
system("cat", "$dir/$len");
}

Results

Here are some timing measurements across three files:

10 million reads
Bash real 1m42.407s user 1m57.863s sys 0m52.479s
Perl real 0m51.698s user 0m35.626s sys 0m9.897s
20 million reads
Bash real 3m58.520s user 4m2.155s sys 1m58.167s
Perl real 1m39.824s user 1m12.765s sys 0m20.409s
30 million reads
Bash real 5m53.346s user 6m16.736s sys 2m51.483s
Perl real 2m23.200s user 1m46.815s sys 0m30.754s

On face value, it appears the shell version ("Bash") is behaving in an O(N.logN) fashion, whereas the Perl implementation seems to be O(N) linear, which is somewhat expected given it reads all the data, writes all the data, then writes it all again. More testing and thinking is needed.

Conclusion

Sorting huge FASTQ files is feasible. If applications like diginorm become mainstream and sorting clipped FASTQ files make it work better, than this is handy to know. Nothing groundbreaking here, and I may have repeated other people's work, but I hope it is useful to somebody.

7 comments:

Austin G. Davis-Richardson7 November 2012 at 08:44
Robert Edgar's UCLUST/USEARCH tools use the mergesort algorithm to sort large fasta files.
Titus Brown7 November 2012 at 14:01
Hmm, my comment went away?

If you're going to tack it on to diginorm or abundance trimming, why not sort the reads as you go through them? See https://github.com/ged-lab/khmer/blob/master/sandbox/filter-abund-output-by-length.py
Torsten Seemann8 November 2012 at 10:12
Titus, not sure what happened to your comment. I've been away sick since I posted this. But thanks for the pointer! We're all still getting our head around the khmer toolset. Your recent 'handbook' is very helpful.
Torsten Seemann8 November 2012 at 10:22
Agder,
Yes you are correct, the recent usearch does have
"usearch -sortbylength in.fa -output out.fa" but I have three problems with it

1. The free version is limited to 32bit (which is about 3GB RAM on my Linux) which is too small for large files. I have 21M reads and it stops at 7M.

2. It only supports FASTA, not FASTQ

3. It crashes with an odd error on any reads.fa file I give it:
usearch -sortbylength in.fa -output out.fa
usearch_i86linux32 v6.0.152, 4.0Gb RAM
http://drive5.com/usearch
00:00 1.9Mb Reading in.fa, 3.0Gb
00:10 3.0Gb 7803063 (7.8M) seqs
---Fatal error---
Invalid byte 0x02x in FASTA file (null)
Mark Ziemann27 November 2012 at 16:06
Thanks, quite helpful!
cv writing service7 July 2017 at 16:46
Great article for everyone to read especially for computer programmers to learn so much and try to avoid such nasty things.
Shrutii Sarda26 September 2020 at 07:23
Thanks!

Tuesday, 6 November 2012

Sorting FASTQ files by sequence length

The problem

Some solutions

Results

Conclusion

7 comments: