Tuesday, 24 July 2012

Navigating microbial genomes on the NCBI FTP site

If you are a bioinformatician working in microbial genomics, then you should know this URL:


If you click on the URL, there is a big list of folders, and it does look like a mess. But for those of us in microbial genomics there are a few key folders you should know about, and probably even have mirrored on your own servers:
  1. ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
  2. ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/
  3. ftp://ftp.ncbi.nih.gov/genomes/Plasmids/
  4. ftp://ftp.ncbi.nih.gov/genomes/Viruses/
  5. ftp://ftp.ncbi.nih.gov/genomes/Fungi/
  6. ftp://ftp.ncbi.nih.gov/genomes/Fungi_DRAFT/
Most of my work is in bacterial genomics, so I'll discuss the contents of the first four folders only. I'll leave the last two to an experienced mycogenomicist.

1. Bacteria

This directory contains a folder for each completed bacterial genome. That is, the genome has been finished to a single DNA sequence per replicon (usually just one chromosome) and is fully annotated. There are currently around 1000 completed bacterial genomes, of which I've been involved in about 10.

Let's have a look at one. I chose Dickeya dadantii because it's a lovely sounding alliteration for a plant pathogen: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Dickeya_dadantii_3937_uid52537/

NC_014500.asn 15.9 MB 13/06/2012 12:11:00
NC_014500.faa 1.7 MB 13/06/2012 12:11:00
NC_014500.ffn 4.5 MB 19/11/2011 11:00:00
NC_014500.fna 4.8 MB 29/09/2010 10:00:00
NC_014500.frn 49.1 kB 29/09/2010 10:00:00
NC_014500.gbk 16.7 MB 13/06/2012 12:11:00
NC_014500.gff 1.8 MB 03/04/2012 03:41:00
NC_014500.ptt 407 kB 10/03/2012 13:18:00
NC_014500.rnt 7.1 kB 29/09/2010 10:00:00
NC_014500.rpt 281 B 25/04/2011 10:00:00
NC_014500.val 7.0 MB 13/06/2012 12:11:00

You can see a bunch of files, all with the same prefix (NC_104500) and a bunch of different suffixes or file extensions (gbk, gff) - some of which should be familiar to you. The NC_014500 is the RefSeq accession ID for the single chromosome of Dickeya dadantii. The most important files are:
  • fna : FASTA file of the chromosomal sequence (think "n" = nucleotide)
  • gbk : Genbank file containing meta-data, sequence, and annotations
  • gff : GFF3 file containing annotations only (coordinates relative to the .fna file)
  • faa : FASTA file of the translated coding regions (proteins) annotated in the .gbk/.gff (think "aa" = amino acids)
In terms of usefulness, the .gbk file contains (nearly) all the information that the other files contain - the .faa and .fna files are easily generated from the .gbk using BioPerl etc. If you want to get the .gbk files for all the finished genomes, you can download the tarball NCBI provides: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.gbk.tar.gz

2. Bacteria_DRAFT

This directory contains folders for each draft bacterial genome. That is, the genome has been de novo assembled into contigs/scaffolds (eg. using Newbler for 454 data) but has not been, and probably never will be, finished. They are usually annotated, either by the submitter or automatically by NCBI, but sometimes there may be only sequences. There is about 2600 draft genomes currently.

Here's the contents of the Thiocapsa marina str. 5811 genome folder - it's a purple sulphur coccus from the Mediterranean Coast if you are interested.

NZ_AFWV00000000.asn        13.5 kB 03/04/2012 03:19:00
NZ_AFWV00000000.contig.asn.tgz 1.7 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.faa.tgz 1.0 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.ffn.tgz 1.5 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.fna.tgz 1.6 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.frn.tgz 4.1 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.gbk.tgz 4.6 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.gbs.tgz 4.2 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.gff.tgz 393 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.ptt.tgz 119 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.rnt.tgz 1.5 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.rpt.tgz 2.5 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.val.tgz 1.6 MB 21/07/2012 02:13:00
NZ_AFWV00000000.gbk         4.7 kB 03/04/2012 03:19:00
NZ_AFWV00000000.rpt         257 B 03/04/2012 03:19:00
NZ_AFWV00000000.val         6.0 kB 03/04/2012 03:19:00

This folder looks a bit different to the finished genomes. It has a .gbk file, but you will notice it is quite small (4700 bytes), and if you look at it, you can see it has no sequence or annotation, only some meta-data and a reference to "WGS  NZ_AFWV01000001-NZ_AFWV01000062".This means that this genome record consist of 62 other records; one for each contig in the assembly. These are stored in the compressed tar file NZ_AFWV00000000.contig.gbk.tgz as follows:

% tar ztf NZ_AFWV00000000.contig.gbk.tgz
NZ_AFWV01000001.gbk
NZ_AFWV01000002.gbk
NZ_AFWV01000003.gbk
...
NZ_AFWV01000061.gbk
NZ_AFWV01000062.gbk

So, in summary, instead of getting a nice neat single .gbk or .faa file for each replicon as you do for the completed genomes, you get a tarball of files for each assembly, with each file representing a contig in the draft genome. Any extra chromosomes or plasmids will be mixed in the bag of contigs.

3. Plasmids

The plasmids folder is not known to many people, it seems a bit hidden away frankly. It contains ~3000 completed plasmid sequences. Confusingly, ~1000 of these are duplicated from the Bacteria folder (as the plasmid was sequenced with its parent), while the other ~2000 are novel. Even more annoying is that the folder structure is different:

faa/ 21/07/2012 19:39:00
fna/ 21/07/2012 19:40:00
gbk/ 21/07/2012 19:41:00
...
plasmids.all.faa.tar.gz 43.2 MB 23/07/2012 19:43:00
plasmids.all.fna.tar.gz 75.1 MB 23/07/2012 19:43:00
plasmids.all.gbk.tar.gz 199 MB 23/07/2012 19:43:00
...

Now we have a folder for each file extension, which each contains 3000 files. So the files for a particular plasmid are spread out over multiple folders. Fortunately they provide compressed tar files of the whole archive to download directly:  plasmids.all.gbk.tar.gz

4. Viruses

Some of you may be wondering why I am including Viruses in this story. Well, some viruses infect Bacteria too - they are called bacteriophage. There are ~3000 folders in the Viruses division, but not all of them are bacteriophage. A simple grep for "phage" suggests ~600 are bacterial viruses.  The folder structure is the same as for the finished Bacteria genomes.

It is important to realise that most of these virus sequences are natively dsDNA and will also appear integrated into the chromosomal DNA of many of the entries in Bacteria and Bacteria_DRAFT. 



22 comments:

  1. As per point 1. , what if a folder has multiple prefix what it means for a completed genome?

    ReplyDelete
  2. Sequer, that is a good question - I forgot to mention that.

    For closed genomes, each unique prefix corresponds to each replicon in that organism ie. more than one chromosome, or plasmids.

    ReplyDelete
  3. Hi,
    Is it possible to download only desired genomes? say i just need complete list of bacillus folders?. I tried to download through
    Filezilla, but unable to connect.

    ReplyDelete
  4. Hi Torsten,

    Do you know why the ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/ server does not have a all.fna.tar.gz file?

    I'm trying to test a new bit of software for designing degenerate primers and the larger diversity of organisms in the draft folder would make a better test than the curated/finished genomes.

    Thanks

    ReplyDelete
  5. I don't know why the DRAFT folder is set out differently to the finished genomes. One reason for the lack of all.*.tar.gz files could be that there is 4x as many draft genomes and the files are just too big for people to reliably download.

    You could use an FTP client that allows wildcards (eg. ncftp) so you can do "mget */*.fna" in the folder.

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. Hello,
    I'm using the Salmonella enterica directory , but it contains many sub-directories with incomprehensible names which variously contain the .gbk etc files. All the sub folders no doubt represent the hundred of salmonella enterica serovars, but I don't understand how I'm supposed to be able to navigate to the correct folder- I can't work out what the system is here.
    Do you have any idea? I would be very grateful for any light you could shed!

    ReplyDelete
    Replies
    1. There should be one folder per strain/genome. What folder do you mean?

      Delete
    2. The trouble is that the folder names are incomprehensible (at least to me) e.g. GCF_000006945, and that many of the folders are empty.

      For example, I am looking for Salmonella Typhimurium strain SL1344 (refseq: NC_016810), but I have no idea in which of the 500 or so folders all named "GCF_000****" to look in... Do you know what these folder names mean?
      (Here is the link to the directory I am talking about in case it helps... ftp://ftp.ncbi.nih.gov/genomes/ASSEMBLY_BACTERIA/Salmonella_enterica/ )

      Delete
    3. My blog post never referred to any such ASSEMBLY_BACTERIA folder?

      The files for the 1 chromosome and 2 plasmids for SL1344 are exactly where they should be:

      ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Salmonella_enterica_serovar_Typhimurium_SL1344_uid86645/

      Delete
  8. Hello, do you have any idea why there are multiple gbk files for the same strain in the same folder? For example, Acetobacter_pasteurianus_386B_uid214433 folder has many gbk files included in it. I am writing a software that downloads specific bacterial genomes and parse them to store data in a database.

    ReplyDelete
    Replies
    1. There is a separate file for each replicon in that species. eg. different chromosomes and plasmids.

      Delete
    2. How can I create a .gbk file with all the replicons from one bacteria? I only got .gb files, but it loses annotation info.

      Delete
    3. Where did you get ".gb" files? They sound like they are probably Genbank files?

      Delete
  9. Hi. Keep up the great work, this blog has been really helpful. I'm new to the whole bioinformatics thing, so I am stumbling around in the dark a bit here. Could you tell me what the ASSEMBLY_BACTERIA folder is about ?

    ReplyDelete
    Replies
    1. That folder is a new part of Genbank. I don't understand it fully yet. When I figure it out I will write a blog post.

      Delete
  10. Does any of you have an idea how to download the genome sequences for the pathogenic bacteria only? Is there any resources?

    ReplyDelete
    Replies
    1. Not that I know of. All bacteria are probably pathogenic, depending on that environment they are placed in. eg. some hurt plants but not animals etc. Some are fine on our skin, but bad in our bloodstream.

      Delete
    2. http://www.absa.org/riskgroups/index.html

      This website contains all the pathogenic bacteria if a search is made by bacteria. You can parse the results to extract the names of the pathogenic bacteria and download them from Genbank

      Delete
  11. This comment has been removed by a blog administrator.

    ReplyDelete
  12. This may sound silly, but I could not navigate to NCBI by using command lftp ftp.ncbi.nlm.nih.govon Terminal. Why?
    I have tried this in a course, it work!

    ReplyDelete
    Replies
    1. I need more information to help you.

      Do you have "lftp" installed?
      Does "ncftp" work?
      Does FileZilla work?

      Delete