The Genome Factory: Counting sequences with Unix tools

Every command line bioinformatician has a suite of specific utility tools in their $PATH for doing core operations on common data files, such as counting the number of sequences in a .fasta file. Sometimes however, you end up on someone else's server where those tools are no longer available. It's in this situation when a good working knowledge of the standard Unix tools can be valuable. Here's a short list of the most useful ones for chaining together:

cat - show
grep - filter
sed - modify
wc - count
cut - extract columns
sort - sort
uniq - identify duplicates
head - extract start
tail - extract end
expr - arithmetic

For example, the classic example is to count the number of sequences in a .fasta file:

% grep '>' in.fasta | wc -l

Because each sequence has a ">" character for its ID line, the grep selects out all the ID lines, which the number of is equal to the number of sequences. The wc command counts its input, and -l means to count lines, rather than characters or words.

Most of us would be using modern GNU/Unix systems, where the grep command has a -c option which counts matching lines rather than printing them out one by one, removing the need for wc:

% grep -c '>' in.fasta

Please ensure you use quote characters around the ">", otherwise the shell will think you want to send the output of grep to in.fasta, which will result in in.fasta being truncated to a zero length file. Not ideal.

More commonly today is the .fastq file, used for storing millions of short reads from high-throughput sequencing machines. An example of one entry is shown below:

@HWUSI-EAS-100R_0002:7:1:2596:12829#ACAGTG/1

TCAAAAATCAGCCGTCACCGAGTATTACCTGAATCACGGCAAATGGCCGGAAAACAC

+HWUSI-EAS-100R_0002:7:1:2596:12829#ACAGTG/1

ffffffffffdfffedaddbaa\ba\Yb`]_a`a```_^`_]YT\Q`]]TT]^BBBB

At first glance, the simplest way to count entries in a .fastq file would be to just extend what we did with .fasta files, but use the "@" character for the ID line matching.

% grep -c '@' in.fastq

Unfortunately, this doesn't always work, as "@" is also a valid symbol in the encoded quality string! No worries, let's use the "+" character on the second ID line. Arrggh, it's also a valid quality symbol! WTF? At this stage you start muttering expletives about moron file format designers, but then calm down when you realise it could have been much worse ie. XML.

Technically, the sequence and quality parts of a .fastq entry can span multiple lines, just like you see 60 column wrapped .fasta files. However, most vendors stick to 4 lines per entry, putting the sequence and quality strings on a single line each. This means we can count entries by dividing the number of lines in the file by four:

% LINES=`cat in.fasta | wc -l`

% READS=`expr $LINES / 4`

% echo $READS

The above is traditional Bourne shell, but most people are using modern-ish shells like BASH, where this can be written more concisely as:

% expr $(cat in.fasta | wc -l) / 4

That's it for now. More posts to follow.

4 comments:

Sajan Raju25 September 2015 at 18:56
Thumbs up for the blog.. It really helped.
btw is it in.fastQ instead of in.fasta in the line "% grep -c '@' in.fasta" ?
Unknown23 June 2018 at 07:04
fine, if you've more unix command for manipuling NGS data, please you post it.
Anonymous12 July 2020 at 00:29
Hi, I find reading this article a joy. It is extremely helpful and interesting and very much looking forward to reading more of your work.. view

Friday, 30 September 2011

Counting sequences with Unix tools

4 comments: