The Genome Factory: Cool use of Unix paste with NGS sequences

Sunday, 20 May 2012

Cool use of Unix paste with NGS sequences

While browsing SeqAnswers.com today I came across a post where Uwe Appelt provided a couple of lines of Unix shell wizadry to solve some problem. What attracted my attention was the following:

paste - - - - < in.fq | filter | tr "\t" "\n" > out.fq

Now, I've done a reasonable amount of shell one-liners in my life, but I'd never seen this before. I've used the paste command a couple of times, but clearly its potential power did not sink in! Here is the man page description for paste:

Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output. With no FILE, or when FILE is -, read standard input.

So what's happening here? Well, in Unix, the "-" character means to use STDIN instead of a filename. Here Uwe is providing paste with four filenames, each of which is the same stdin filehandle. So lines 1..4 of input.fq are put onto one line (with tab separator), and lines 5..8 on the next line and so on. Now, our stream has the four lines of FASTQ entry on a single line, which makes it much more amenable to Unix line-based manipulation, represented by filter in my example. Once that's all done, we need to put it back into the standard 4-line FASTQ format, which is as simple as converting the tabs "\t" back to newlines "\n" with the tr command.

Example 1: FASTQ to FASTA

A common thing to do is convert FASTQ to FASTA, and we don't always have our favourite tool or script to to this when we aren't on our own servers:

paste - - - - < in.fq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > out.fa

paste converts the input FASTQ into a 4-column file
cut command extracts out just column 1 (the ID) and column 2 (the sequence)
sed replaces the FASTQ ID prefix "@" with the FASTA ID prefix ">"
tr conversts the 2 columns back into 2 lines

And because the shell command above uses a pipe connecting four commands (paste, cut, sed, tr) the operating system will run them all in parallel, which will make it run faster assuming your disk I/O can keep up.

Example 2: Removing redundant FASTQ ID in line 3

The third line in the FASTQ format is somewhat redundant - it is usually a duplicate of the first line, except with "+" instead of "@" to denote that a quality string is coming next rather than an ID. Most parsers ignore it, and happily accept a blank ID after the "+", which saves a fair chunk of disk space. If you have legacy files with the redundant IDs and want to conver them, here's how we can do it with our new paste trick:

paste -d ' ' - - - - | sed 's/ +[^ ]*/ +/' | tr " " "\n"

paste converts the input FASTQ into a 4-column file, but using SPACE instead of TAB as the separator character
sed finds and replaces the "+DUPE_ID" line with just a "+"
tr conversts the 4 columns back into 4 lines

That's it for today, hope you learnt something, because I certainly did.

9 comments:

Michael Hoffman31 July 2012 at 09:15
In Example 1, why not just convert the input FASTQ into a 2-column file in the first place?
ReplyDelete
Replies
Daniel Standage15 August 2012 at 20:48
Nice! I love finding little Linux/UNIX gems like this! http://gremlin2.soic.indiana.edu/blog/
ReplyDelete
Replies
Torsten Seemann17 August 2012 at 11:22
Daniel, thanks for the comment and link to your blog - some cool content there too. You're in my Google Reader list now.
ReplyDelete
Replies
Unknown9 July 2013 at 23:20
I am enjoying reading this blog of yours...I am a pure biologist by nature or by degree should i say??ha ha ha...
I am looking forward in implicating various programming languages somewhere in studying gene expression along with the usual wet lab skills. Now that i have learnt C,C++,Linux(LPI 101) and perl....am now switching to NGS workflow...I too have started a blog as it helps improve mistake and learn....
Keep up the good job
ReplyDelete
Replies
Torsten Seemann10 July 2013 at 08:15
Alok

Thanks for the feedback! A good biologist who can do computing/programming as well is an awesome combination when doing NGS studies.

I recommend everyone write a blog. It forces you to understand things properly. Often we "think" we understand something, but really we don't. It also it good for practising scientific writing and clear communication. The only downside is that it does take time, but I try to think of it as an investment.

All the best for your future projects!

ReplyDelete
Replies
Isobe Ltin13 July 2018 at 01:39
Most parsers ignore it, and happily accept a blank ID.
compare airport parking
ReplyDelete
Replies
James Carry22 February 2024 at 04:04
"Meet And Greet At Luton" by EzyBook sounds like the perfect solution for hassle-free airport parking! This blog post on utilizing Unix paste with NGS is fascinating and shows the versatility of technology in genomic research. It's amazing to see how different fields intersect, demonstrating the endless possibilities when innovation meets science. Looking forward to more insightful content from The Genome Factory.
ReplyDelete
Replies

Add comment