Monday, 9 September 2019

25 reasons assemblies don't make it into Refseq

Introduction

When you submit a genome assembly, or NCBI assembles the reads you submitted, it ends up in Genbank. If the assembly is of sufficient quality, it is annotated with PGAP and added to Refseq. Note that this means the same assembly can exist in Genbank (with your original annotation) and in Refseq with the PGAP annotation and a different accession number.

There are many reasons why a Genbank assembly could be denied admission to Refseq, and here I outline how to find out if the genome you are working with is potentially bad, and why.

Commands

wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt

pip3 install csvkit

csvcut -t -K 1 -c 'excluded_from_refseq' assembly_summary_genbank.txt \
  | tail -n +2 | tr ";" "\n" \
    | sed -e 's/^ //' -e 's/ $//' | grep -v '""' \
      | sort | uniq -c | sort -nr

Results

 198215 derived from surveillance project
  32538 derived from metagenome
  16820 derived from environmental source
  10389 metagenome
   4684 low contig N50
   2187 partial
   1822 many frameshifted proteins
   1568 derived from single cell
    856 genome length too large
    820 genome length too small
    752 high contig L50
    562 low quality sequence
    357 contaminated
    352 missing tRNA genes
    309 abnormal gene to sequence ratio
    274 validation errors
    194 missing ribosomal protein genes
    114 missing rRNA genes
    104 untrustworthy as type
     77 unverified source organism
     38 misassembled
     12 mixed culture
      6 chimeric
      4 low gene count
      2 hybrid

Conclusion

Stick to using genomes from Refseq wherever possible. The main exception may be if you are working in public health microbiology, then the "derived from surveillance project" reason probably means it is part of GenomeTrakr and might still be of importance.