Sunday, May 20, 2012

Cool use of Unix paste with NGS sequences

While browsing SeqAnswers.com  today I came across a post where Uwe Appelt provided a couple of lines of Unix shell wizadry to solve some problem. What attracted my attention was the following:
paste - - - - < in.fq | filter | tr "\t" "\n" > out.fq
Now, I've done a reasonable amount of shell one-liners in my life, but I'd never seen this before.  I've used the paste command a couple of times, but clearly its potential power did not sink in! Here is the man page description for paste:
Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output. With no FILE, or when FILE is -, read standard input.
So what's happening here? Well, in Unix, the "-" character means to use STDIN instead of a filename. Here Uwe is providing paste with four filenames, each of which is the same stdin filehandle. So lines 1..4 of input.fq are put onto one line (with tab separator), and lines 5..8 on the next line and so on. Now, our stream has the four lines of FASTQ entry on a single line, which makes it much more amenable to Unix line-based manipulation, represented by filter in my example. Once that's all done, we need to put it back into the standard 4-line FASTQ format, which is as simple as converting the tabs "\t" back to newlines "\n" with the tr command.

Example 1: FASTQ to FASTQ

A common thing to do is convert FASTQ to FASTA, and we don't always have our favourite tool or script to to this when we aren't on our own servers:

paste - - - - < in.fq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > out.fa

  1. paste converts the input FASTQ into a 4-column file
  2. cut command extracts out just column 1 (the ID) and column 2 (the sequence)
  3. sed replaces the FASTQ ID prefix "@" with the FASTA ID prefix ">"
  4. tr conversts the 2 columns back into 2 lines

And because the shell command above uses a pipe connecting four commands (paste, cut, sed, tr) the operating system will run them all in parallel, which will make it run faster assuming your disk I/O can keep up. 

Example 2: Removing redundant FASTQ ID in line 3

The third line in the FASTQ format is somewhat redundant - it is usually a duplicate of the first line, except with "+" instead of "@" to denote that a quality string is coming next rather than an ID. Most parsers ignore it, and happily accept a blank ID after the "+", which saves a fair chunk of disk space. If you have legacy files with the redundant IDs and want to conver them, here's how we can do it with our new paste trick:

paste -d ' ' - - - - | sed 's/ +[^ ]*/ +/' | tr " " "\n"
  1. paste converts the input FASTQ into a 4-column file, but using SPACE instead of TAB as the separator character
  2. sed finds and replaces the "+DUPE_ID" line with just a "+"
  3. tr conversts the 4 columns back into 4 lines

That's it for today, hope you learnt something, because I certainly did.

Saturday, April 28, 2012

Prokka - rapid prokaryotic annotation

Prokka is a software tool I have written to annotate bacterial, archaeal and viral genomes. It is based on years of experience annotating bacterial genomes, both automatically and via manual curation. 

It's main design considerations were to be:
  • fast
    • supports multi-threading
    • hierarchical search databases
  • simple to use
    • no compulsory parameters
    • bundled databases
  • clean
    • standards-compliant output files
    • pipeline-friendly interface
  • thorough
    • finds tRNA, rRNA, CDS, sig_peptide, tandem repeats, ncRNA
    • includes /gene and /EC_number where possible, not just /product
    • traceable annotation sources via /inference tags
  • useful
    • produce files ready for submission to Genbank
    • complete log file


The first release is a monolithic, but followable Perl script. It only uses core Perl modules, but has quite a few external tool dependencies, some of which I can't bundle due to licence restrictions. Eventually I hope to have a public web-server version, and a version of it in the Galaxy Toolshed. 

It currently takes about 10 minutes on a quad Intel i7 for a typical 4 Mbp genome.

You can download it from here and read the manual here.

Tuesday, March 20, 2012

LEGO Ion PGM sequencer

A while back I came across a cool LEGO model of a Hi-Seq 2000 sequencing instrument created by Dawei Lin at the UC Davis Genome Centre.

My good colleague Tim Stinear was one of the first Australian labs to purchase an Ion Torrent PGM sequencing instrument, which looks like this:


Now, my oldest child Oskar just turned 6 years old, and he was ready for LEGO. Being a former LEGO nerd, my nostolgia went into overdrive and I "invested" in some generic LEGO sets for his and Zoe's  Xmas presents, subsequently supplemented by some Ebay bulk lots. So now I had an excuse to lie on the rug all day playing with LEGO.

And here is my first attempt at a LEGO model of a PGM:



Not perfect or quite to proportions, but not a bad effort I reckon!


Thursday, November 10, 2011

Compressing FASTQ reads by splitting into homogeneous streams


Today I took FASTQ file with 3.5M reads, which was Read1 from a paired-end Illumina 100bp run - it was about 883Mb in size. As many have shown before me, GZIP compresses to about 1/4 the size, and BZIP2 about 1/5.
  • 883252 R1.fastq
  • 233296 R1.fastq.gz
  • 182056 R1.fastq.bz2
I then split the read file into 3 separate files: (1) The ID line, but with the mandatory '@' removed, (2) the sequence line, but uppercased for consistency, and (3) the quality line unchanged. It ignored the 3rd line of each FASTQ entry, as it is redundant. This knocked 1% off the total size.
  • 189588 id.txt
  • 341756 seq.txt
  • 341756 qual.txt
  • 873100 TOTAL
Now, I compressed each of the three streams (ID, Sequence, Quality) independently with GZIP. The idea is that these dictionary-based compression schemes will work better on more homogeneous data streams, than when they are interleaved in one stream. As you can see this does improve things by about 15%, but still not as good as BZIP2 without de-interleaving.
  •  20608 id.txt.gz
  •  84096 qual.txt.gz
  • 102040 seq.txt.gz
  • 206644 TOTAL (was 233296 combined)
If we use BZIP2 to compress the interleaved stream, it does only 5% better than when it was a single stream. This is testament to BZIP2's ability to cope with heterogeneous data streams better than GZIP.
  •  16560 id.txt.bz2
  •  66812 qual.txt.bz2
  •  93564 seq.txt.bz2
  • 176936 TOTAL (was 182056 combined)
So in summary, we've re-learnt that BZIP2 is better than GZIP, and that they are both doing quite well adapting to the three interleaved data types in a FASTQ file.


Friday, September 30, 2011

Counting sequences with Unix tools

Every command line bioinformatician has a suite of specific utility tools in their $PATH for doing core operations on common data files, such as counting the number of sequences in a .fasta file. Sometimes however, you end up on someone else's server where those tools are no longer available. It's in this situation when a good working knowledge of the standard Unix tools can be valuable. Here's a short list of the most useful ones for chaining together:

  • cat - show
  • grep - filter
  • sed - modify
  • wc - count
  • cut - extract columns
  • sort - sort
  • uniq - identify duplicates
  • head - extract start
  • tail - extract end
  • expr - arithmetic
For example, the classic example is to count the number of sequences in a .fasta file:

% grep '>' in.fasta | wc -l

Because each sequence has a ">" character for its ID line, the grep selects out all the ID lines, which the number of is equal to the number of sequences. The wc command counts its input, and -l means to count lines, rather than characters or words.

Most of us would be using modern GNU/Unix systems, where the grep command has a -c option which counts matching lines rather than printing them out one by one, removing the need for wc:

% grep -c '>' in.fasta

Please ensure you use quote characters around the ">", otherwise the shell will think you want to send the output of grep to in.fasta, which will result in in.fasta being truncated to a zero length file. Not ideal.

More commonly today is the .fastq file, used for storing millions of short reads from high-throughput sequencing machines. An example of one entry is shown below:

@HWUSI-EAS-100R_0002:7:1:2596:12829#ACAGTG/1
TCAAAAATCAGCCGTCACCGAGTATTACCTGAATCACGGCAAATGGCCGGAAAACAC
+HWUSI-EAS-100R_0002:7:1:2596:12829#ACAGTG/1
ffffffffffdfffedaddbaa\ba\Yb`]_a`a```_^`_]YT\Q`]]TT]^BBBB

At first glance, the simplest way to count entries in a .fastq file would be to just extend what we did with .fasta files, but use the "@" character for the ID line matching.

% grep -c '@' in.fasta

Unfortunately, this doesn't always work, as "@" is also a valid symbol in the encoded quality string! No worries, let's use the "+" character on the second ID line. Arrggh, it's also a valid quality symbol! WTF? At this stage you start muttering expletives about moron file format designers, but then calm down when you realise it could have been much worse ie. XML.

Technically, the sequence and quality parts of a .fastq entry can span multiple lines, just like you see 60 column wrapped .fasta files. However, most vendors stick to 4 lines per entry, putting the sequence and quality strings on a single line each. This means we can count entries by dividing the number of lines in the file by four:

% LINES=`cat in.fasta | wc -l`
% READS=`expr $LINES / 4`
% echo $READS

The above is traditional Bourne shell, but most people are using modern-ish shells like BASH, where this can be written more concisely as:

% expr $(cat in.fasta | wc -l) / 4

That's it for now. More posts to follow.


Wednesday, September 21, 2011

The easy way to close a genome

I don't know what all the fuss is about. If molecular biologists simply learnt some basic Unix command line tools, they could rid themselves of messy PCRs, oligo design, optical maps, primer walking etc.

% echo ">ClosedGenome" > closed.fasta
% grep -v '>' 454AllContigs.fna | sed 's/[^AGTC]//gi' >> closed.fasta
% mail -s "Genbank genome submission" genomes@ncbi.nlm.nih.gov < closed.fasta

Too easy! :-P