Saturday, 10 September 2016

Manipulating big TSV files in the Unix terminal

Introduction

Bioinformatics is full of plain text, machine readable file formats. One of my favourites is the "tab separated values" format (TSV). It's like the popular "comma separated values" format (CSV) but uses a tab instead of a comma. The main advantage of TSV over CSV is that you don't need to escape or quote your values when they have a comma in them, and many Unix tools for manipulating columnar data default to using tab as the field separator.

Some TSV files are huge - not just in rows, but also in columns. An important new TSV file for those who work in comparative genomics is the new NCBI assembly_summary.txt file. Here is the first 3 lines from the archaea summary file:

% head -n 3 assembly_summary.txt

# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq
GCF_000302455.1 PRJNA224116 SAMN02471940 AMPO00000000.1 representative genome 1204725 2162 Methanobacterium formicicum DSM 3637 strain=DSM 3637 latest Contig Major Full 2012/10/02 ASM30245v1 Deparment of Genetics, University of Seville, Spain GCA_000302455.1 identical ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000302455.1_ASM30245v1

It is not very clear what is what. At this point some people would load this into Excel, R or Pandas, and those are perfectly fine. But if you need to write a portable pipeline script or are on an unfamiliar server you will want to stick with common Unix tools.

The first problem is that the first line is not the column headers, but a pointer to some documentation. Let's get rid of the first line:

% tail -n +2 assembly_summary.txt | head -n 3

# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq
GCF_000302455.1 PRJNA224116 SAMN02471940 AMPO00000000.1 representative genome 1204725 2162 Methanobacterium formicicum DSM 3637 strain=DSM 3637 latest Contig Major Full 2012/10/02 ASM30245v1 Deparment of Genetics, University of Seville, Spain GCA_000302455.1 identical ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000302455.1_ASM30245v1
GCF_000762265.1 PRJNA224116 SAMN03085433 representative genome 2162 2162 Methanobacterium formicicum strain=BRM9 latest Complete Genome Major Full 2014/10/02 ASM76226v1 PGgRc GCA_000762265.1 identical ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000762265.1_ASM76226v1

Viewing

It's still confusing because the rows are wider than the terminal and are wrapping. We can turn wrapping off in the less file viewer woth the -S option, and even use the left and right cursor keys to pan left and right!

% tail -n +2 assembly_summary.txt | less -S

# assembly_accession bioproject biosample wgs_master refseq_category t
GCF_000302455.1 PRJNA224116 SAMN02471940 AMPO00000000.1 representative
GCF_000762265.1 PRJNA224116 SAMN03085433 representative genome 2162 2
...

The horizontal scrolling within the terminal is very useful, but the columns still don't line up. Of course, there is already a Unix tool called column to help! This will examine all the data and convert each tab into a different number of spaces to make everything line up like a spreadsheet.

% tail -n +2 assembly_summary.txt | column -t | less -S

# assembly_accession bioproject biosample
GCF_000302455.1 PRJNA224116 SAMN02471940 AMPO00000000.1
GCF_000762265.1 PRJNA224116 SAMN03085433 representative
...

Manipulating

Okay, so this is very useful for looking at wide TSV files, but what if I want to select particular columns with cut and sort? How do I know which column to cut? Well, we can use some standard Unix tools to do this magic:

% tail -n +2 assembly_summary.txt | head -n 1 | tr "\t" "\n" | nl

     1  # assembly_accession
     2  bioproject
     3  biosample
     4  wgs_master
     5  refseq_category
     6  taxid
     7  species_taxid
     8  organism_name
     9  infraspecific_name
    10  isolate
    11  version_status
    12  assembly_level
    13  release_type
    14  genome_rep
    15  seq_rel_date
    16  asm_name
    17  submitter
    18  gbrs_paired_asm
    19  paired_asm_comp
    20  ftp_path
    21  excluded_from_refseq

This is how it works:

  1. tail -n +2 skips over the first junk line
  2. head -n 1 keeps the first line only (which is the TSV header line)
  3. tr converts the tab characters to newline characters to turn columns into rows
  4. nl is a little known Unix command which adds line numbers
if you have zero-based columns, you can use nl -v 0 instead - type man nl for more details.

You can then cut and sort with ease! For example, this one-liner lists the top 10 Archaeal genome assemblies available in Genbank:

% tail -n +3 assembly_summary.txt | cut -f8 | sort | uniq -c | sort -nr | head -n 10

     56 Methanosarcina mazei
     46 Sulfolobus acidocaldarius
      6 Metallosphaera sedula
      4 Sulfolobus solfataricus
      3 Methanobacterium formicicum
      3 Haloferax mediterranei ATCC 33500
      3 Halalkalicoccus jeotgali B3
      2 Sulfolobus solfataricus 98/2
      2 Natronorubrum tibetense GA33
      2 Natronobacterium gregoryi SP2

Conclusion

If you need to do more advanced things in the shell, here are 3 excellent command line tools specifically designed to manipulate TSV files:

If this post has stopped at least one person resorting to Excel unnecessarily, then it was worth it.

Monday, 11 April 2016

What bacterial genome assemblers are people using?

Introduction

As of April 2016, there are about 70,000 genome assemblies in Genbank (draft and complete), with the majority being bacterial genomes. For genomes that have been submitted in NGS era, the COMMENT section of the Genbank file header has machine readable information about the sequencing technology, depth of coverage, and software used.

For example, the entry for Enterococcus faecium OC2A-1 contains this:

##Genome-Assembly-Data-START##
Finishing Goal           :: High-Quality Draft
Current Finishing Status :: High-Quality Draft
Assembly Method          :: Velvet v. 1.1.06
Genome Coverage          :: 104x
Sequencing Technology    :: Illumina
##Genome-Assembly-Data-END##

Method

I decided to parse this header for all the bacterial .gbff.gz (GenBank File Format, aka .gbk) files available at NCBI FTP to see what genome assembly software is being used for bacterial genomes. Now, like any user provided information, there is a lot of junk in this field, so I wrote some curated regexps to categorise them into cleaner bins. If more than one method was listed, I binned into Hybrid/Mixed. If if it was too minor or probably wrong I binned as Could not parse.

Results

CountAssembler Software
23725Not provided
9883AllPaths
5325Newbler
3783Velvet
3585CLC Genomics Workbench
3347Spades
2610IDBA
2477Celera Assembler
2082ABYSS
1815CLC NGS Cell
1782SOAPdenovo
1370Could not parse
1119HGAP
870MaSuRCA
853MIRA
793A5-MiSeq
308Ray
149Phred/Phrap/Consed
132Geneious
110SeqMan
109HGAP3
98Edena
69Hybrid/Mixed
59DNAstar
55Platanus
53NextGene
20Arachne
19DISCOVAR
9VelvetOptimiser
5Falcon
4Megahit
66618Total

Discussion

I was a little surprised to see ALLPATHS top the list due to its particular requirements for DNA library construction (overlapping PE + long mate pair), but the Broad Institute does do a lot of sequencing. A lot of people are using Velvet and Spades, but equal many using CLC Workbench or the NGS Cell product.

The most disturbing and funniest entries in the Could not parse division are listed below.

in-house software v. 10/18/2012
Unknown program v. before 2013-07-02
Direct Sequencing
DNASTAR SeqMan NGen v. 4.0.0
GS Reference Mapper v. September 2013
Trimmomatic v. 0.32;
Ion Torrent PGM
Artimis v. 10.1 
artimist v. 10.1
De Bruijn graph v. Apr-2011
BCFtools Consensus
BLASTN v. actual
BOWTIE v. Version 2.1.0
BWA v. 0.5.1
BioNumerics v. 6.6
ELAND alignment algorithm
Galaxy v. May 2012
de Bruijn graphs v. Mar-2013
MAQ v. 0.7.1
MATLAB v. R2013a

At the top we have in-house software (with a version number!). The Direct Sequencing could be a single perfect read of full chromosome from a really lucky Oxford Nanopore user. Is there anything Artimist (aka Artemis) cannot do? I need to upgrade my version of Trimmomatic and "actual" BLASTN too.

Conclusion

My main concern is the number of read aligners listed. There are some draft genomes myself and others have encountered where it appears the submitters have just aligned the reads to a close reference and submitted the consensus sequence as the assembly. These "genomes" sometimes cause problems in population studies, and I'd rather the reads be available instead.