Today I took FASTQ file with 3.5M reads, which was Read1 from a paired-end Illumina 100bp run - it was about 883Mb in size. As many have shown before me, GZIP compresses to about 1/4 the size, and BZIP2 about 1/5.
- 883252 R1.fastq
- 233296 R1.fastq.gz
- 182056 R1.fastq.bz2
I then split the read file into 3 separate files: (1) The ID line, but with the mandatory '@' removed, (2) the sequence line, but uppercased for consistency, and (3) the quality line unchanged. It ignored the 3rd line of each FASTQ entry, as it is redundant. This knocked 1% off the total size.
- 189588 id.txt
- 341756 seq.txt
- 341756 qual.txt
- 873100 TOTAL
Now, I compressed each of the three streams (ID, Sequence, Quality) independently with GZIP. The idea is that these dictionary-based compression schemes will work better on more homogeneous data streams, than when they are interleaved in one stream. As you can see this does improve things by about 15%, but still not as good as BZIP2 without de-interleaving.
- 20608 id.txt.gz
- 84096 qual.txt.gz
- 102040 seq.txt.gz
- 206644 TOTAL (was 233296 combined)
If we use BZIP2 to compress the interleaved stream, it does only 5% better than when it was a single stream. This is testament to BZIP2's ability to cope with heterogeneous data streams better than GZIP.
- 16560 id.txt.bz2
- 66812 qual.txt.bz2
- 93564 seq.txt.bz2
- 176936 TOTAL (was 182056 combined)
So in summary, we've re-learnt that BZIP2 is better than GZIP, and that they are both doing quite well adapting to the three interleaved data types in a FASTQ file.