- cat - show
- grep - filter
- sed - modify
- wc - count
- cut - extract columns
- sort - sort
- uniq - identify duplicates
- head - extract start
- tail - extract end
- expr - arithmetic
For example, the classic example is to count the number of sequences in a .fasta file:
% grep '>' in.fasta | wc -l
Because each sequence has a ">" character for its ID line, the grep selects out all the ID lines, which the number of is equal to the number of sequences. The wc command counts its input, and -l means to count lines, rather than characters or words.
Most of us would be using modern GNU/Unix systems, where the grep command has a -c option which counts matching lines rather than printing them out one by one, removing the need for wc:
% grep -c '>' in.fasta
Please ensure you use quote characters around the ">", otherwise the shell will think you want to send the output of grep to in.fasta, which will result in in.fasta being truncated to a zero length file. Not ideal.
More commonly today is the .fastq file, used for storing millions of short reads from high-throughput sequencing machines. An example of one entry is shown below:
@HWUSI-EAS-100R_0002:7:1:2596:12829#ACAGTG/1
TCAAAAATCAGCCGTCACCGAGTATTACCTGAATCACGGCAAATGGCCGGAAAACAC
+HWUSI-EAS-100R_0002:7:1:2596:12829#ACAGTG/1
ffffffffffdfffedaddbaa\ba\Yb`]_a`a```_^`_]YT\Q`]]TT]^BBBB
At first glance, the simplest way to count entries in a .fastq file would be to just extend what we did with .fasta files, but use the "@" character for the ID line matching.
% grep -c '@' in.fastq
Unfortunately, this doesn't always work, as "@" is also a valid symbol in the encoded quality string! No worries, let's use the "+" character on the second ID line. Arrggh, it's also a valid quality symbol! WTF? At this stage you start muttering expletives about moron file format designers, but then calm down when you realise it could have been much worse ie. XML.
Technically, the sequence and quality parts of a .fastq entry can span multiple lines, just like you see 60 column wrapped .fasta files. However, most vendors stick to 4 lines per entry, putting the sequence and quality strings on a single line each. This means we can count entries by dividing the number of lines in the file by four:
% LINES=`cat in.fasta | wc -l`
% READS=`expr $LINES / 4`
% echo $READS
The above is traditional Bourne shell, but most people are using modern-ish shells like BASH, where this can be written more concisely as:
% expr $(cat in.fasta | wc -l) / 4
That's it for now. More posts to follow.