Sunday, 20 May 2012

Cool use of Unix paste with NGS sequences

While browsing SeqAnswers.com  today I came across a post where Uwe Appelt provided a couple of lines of Unix shell wizadry to solve some problem. What attracted my attention was the following:
paste - - - - < in.fq | filter | tr "\t" "\n" > out.fq
Now, I've done a reasonable amount of shell one-liners in my life, but I'd never seen this before.  I've used the paste command a couple of times, but clearly its potential power did not sink in! Here is the man page description for paste:
Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output. With no FILE, or when FILE is -, read standard input.
So what's happening here? Well, in Unix, the "-" character means to use STDIN instead of a filename. Here Uwe is providing paste with four filenames, each of which is the same stdin filehandle. So lines 1..4 of input.fq are put onto one line (with tab separator), and lines 5..8 on the next line and so on. Now, our stream has the four lines of FASTQ entry on a single line, which makes it much more amenable to Unix line-based manipulation, represented by filter in my example. Once that's all done, we need to put it back into the standard 4-line FASTQ format, which is as simple as converting the tabs "\t" back to newlines "\n" with the tr command.

Example 1: FASTQ to FASTA



A common thing to do is convert FASTQ to FASTA, and we don't always have our favourite tool or script to to this when we aren't on our own servers:

paste - - - - < in.fq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > out.fa

  1. paste converts the input FASTQ into a 4-column file
  2. cut command extracts out just column 1 (the ID) and column 2 (the sequence)
  3. sed replaces the FASTQ ID prefix "@" with the FASTA ID prefix ">"
  4. tr conversts the 2 columns back into 2 lines

And because the shell command above uses a pipe connecting four commands (paste, cut, sed, tr) the operating system will run them all in parallel, which will make it run faster assuming your disk I/O can keep up. 

Example 2: Removing redundant FASTQ ID in line 3

The third line in the FASTQ format is somewhat redundant - it is usually a duplicate of the first line, except with "+" instead of "@" to denote that a quality string is coming next rather than an ID. Most parsers ignore it, and happily accept a blank ID after the "+", which saves a fair chunk of disk space. If you have legacy files with the redundant IDs and want to conver them, here's how we can do it with our new paste trick:

paste -d ' ' - - - - | sed 's/ +[^ ]*/ +/' | tr " " "\n"
  1. paste converts the input FASTQ into a 4-column file, but using SPACE instead of TAB as the separator character
  2. sed finds and replaces the "+DUPE_ID" line with just a "+"
  3. tr conversts the 4 columns back into 4 lines

That's it for today, hope you learnt something, because I certainly did.

9 comments:

  1. In Example 1, why not just convert the input FASTQ into a 2-column file in the first place?

    ReplyDelete
    Replies
    1. Michael, The example was meant to be illustrative. I assume people are very familiar with FASTQ and FASTA, but not familiar with 'paste'. Can you provide a command line which does it the way you suggest?

      Delete
    2. maybe

      sed -n '1~4s/^@/>/p;2~4p' input.fastq output.fasta

      is an alternative

      Delete
  2. Daniel, thanks for the comment and link to your blog - some cool content there too. You're in my Google Reader list now.

    ReplyDelete
  3. I am enjoying reading this blog of yours...I am a pure biologist by nature or by degree should i say??ha ha ha...
    I am looking forward in implicating various programming languages somewhere in studying gene expression along with the usual wet lab skills. Now that i have learnt C,C++,Linux(LPI 101) and perl....am now switching to NGS workflow...I too have started a blog as it helps improve mistake and learn....
    Keep up the good job

    ReplyDelete
  4. Alok

    Thanks for the feedback! A good biologist who can do computing/programming as well is an awesome combination when doing NGS studies.

    I recommend everyone write a blog. It forces you to understand things properly. Often we "think" we understand something, but really we don't. It also it good for practising scientific writing and clear communication. The only downside is that it does take time, but I try to think of it as an investment.

    All the best for your future projects!

    ReplyDelete
  5. Most parsers ignore it, and happily accept a blank ID.
    compare airport parking

    ReplyDelete
  6. "Meet And Greet At Luton" by EzyBook sounds like the perfect solution for hassle-free airport parking! This blog post on utilizing Unix paste with NGS is fascinating and shows the versatility of technology in genomic research. It's amazing to see how different fields intersect, demonstrating the endless possibilities when innovation meets science. Looking forward to more insightful content from The Genome Factory.

    ReplyDelete