The Genome Factory: Compressing FASTQ reads by splitting into homogeneous streams

Thursday, 10 November 2011

Compressing FASTQ reads by splitting into homogeneous streams

Today I took FASTQ file with 3.5M reads, which was Read1 from a paired-end Illumina 100bp run - it was about 883Mb in size. As many have shown before me, GZIP compresses to about 1/4 the size, and BZIP2 about 1/5.

883252 R1.fastq
233296 R1.fastq.gz
182056 R1.fastq.bz2

I then split the read file into 3 separate files: (1) The ID line, but with the mandatory '@' removed, (2) the sequence line, but uppercased for consistency, and (3) the quality line unchanged. It ignored the 3rd line of each FASTQ entry, as it is redundant. This knocked 1% off the total size.

189588 id.txt
341756 seq.txt
341756 qual.txt
873100 TOTAL

Now, I compressed each of the three streams (ID, Sequence, Quality) independently with GZIP. The idea is that these dictionary-based compression schemes will work better on more homogeneous data streams, than when they are interleaved in one stream. As you can see this does improve things by about 15%, but still not as good as BZIP2 without de-interleaving.

20608 id.txt.gz
84096 qual.txt.gz
102040 seq.txt.gz
206644 TOTAL (was 233296 combined)

If we use BZIP2 to compress the interleaved stream, it does only 5% better than when it was a single stream. This is testament to BZIP2's ability to cope with heterogeneous data streams better than GZIP.

16560 id.txt.bz2
66812 qual.txt.bz2
93564 seq.txt.bz2
176936 TOTAL (was 182056 combined)

So in summary, we've re-learnt that BZIP2 is better than GZIP, and that they are both doing quite well adapting to the three interleaved data types in a FASTQ file.

1 comment:

Torsten Seemann16 November 2011 at 11:37
The total for the deinterleaved bzip2 method 4.04 bits per nucleotide in the original file. The DNA is 2.14 bits of that, the qualities are 1.53, and the IDs are 0.37. I think sorting the reads lexographically would probably help both gzip (LZ77) and bzip2 (BWT+).
ReplyDelete
Replies