The Genome Factory: Navigating microbial genomes on the NCBI FTP site

Tuesday, 24 July 2012

Navigating microbial genomes on the NCBI FTP site

If you are a bioinformatician working in microbial genomics, then you should know this URL:

ftp://ftp.ncbi.nih.gov/genomes/

If you click on the URL, there is a big list of folders, and it does look like a mess. But for those of us in microbial genomics there are a few key folders you should know about, and probably even have mirrored on your own servers:

Most of my work is in bacterial genomics, so I'll discuss the contents of the first four folders only. I'll leave the last two to an experienced mycogenomicist.

1. Bacteria

This directory contains a folder for each completed bacterial genome. That is, the genome has been finished to a single DNA sequence per replicon (usually just one chromosome) and is fully annotated. There are currently around 1000 completed bacterial genomes, of which I've been involved in about 10.

Let's have a look at one. I chose Dickeya dadantii because it's a lovely sounding alliteration for a plant pathogen: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Dickeya_dadantii_3937_uid52537/

NC_014500.asn 15.9 MB 13/06/2012 12:11:00

NC_014500.faa 1.7 MB 13/06/2012 12:11:00

NC_014500.ffn 4.5 MB 19/11/2011 11:00:00

NC_014500.fna 4.8 MB 29/09/2010 10:00:00

NC_014500.frn 49.1 kB 29/09/2010 10:00:00

NC_014500.gbk 16.7 MB 13/06/2012 12:11:00

NC_014500.gff 1.8 MB 03/04/2012 03:41:00

NC_014500.ptt 407 kB 10/03/2012 13:18:00

NC_014500.rnt 7.1 kB 29/09/2010 10:00:00

NC_014500.rpt 281 B 25/04/2011 10:00:00

NC_014500.val 7.0 MB 13/06/2012 12:11:00

You can see a bunch of files, all with the same prefix (NC_104500) and a bunch of different suffixes or file extensions (gbk, gff) - some of which should be familiar to you. The NC_014500 is the RefSeq accession ID for the single chromosome of Dickeya dadantii. The most important files are:

fna : FASTA file of the chromosomal sequence (think "n" = nucleotide)
gbk : Genbank file containing meta-data, sequence, and annotations
gff : GFF3 file containing annotations only (coordinates relative to the .fna file)
faa : FASTA file of the translated coding regions (proteins) annotated in the .gbk/.gff (think "aa" = amino acids)

In terms of usefulness, the .gbk file contains (nearly) all the information that the other files contain - the .faa and .fna files are easily generated from the .gbk using BioPerl etc. If you want to get the .gbk files for all the finished genomes, you can download the tarball NCBI provides: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.gbk.tar.gz

2. Bacteria_DRAFT

This directory contains folders for each draft bacterial genome. That is, the genome has been de novo assembled into contigs/scaffolds (eg. using Newbler for 454 data) but has not been, and probably never will be, finished. They are usually annotated, either by the submitter or automatically by NCBI, but sometimes there may be only sequences. There is about 2600 draft genomes currently.

Here's the contents of the Thiocapsa marina str. 5811 genome folder - it's a purple sulphur coccus from the Mediterranean Coast if you are interested.

NZ_AFWV00000000.asn 13.5 kB 03/04/2012 03:19:00

NZ_AFWV00000000.contig.asn.tgz 1.7 MB 21/07/2012 02:13:00

NZ_AFWV00000000.contig.faa.tgz 1.0 MB 21/07/2012 02:13:00

NZ_AFWV00000000.contig.ffn.tgz 1.5 MB 21/07/2012 02:13:00

NZ_AFWV00000000.contig.fna.tgz 1.6 MB 21/07/2012 02:13:00

NZ_AFWV00000000.contig.frn.tgz 4.1 kB 21/07/2012 02:13:00

NZ_AFWV00000000.contig.gbk.tgz 4.6 MB 21/07/2012 02:13:00

NZ_AFWV00000000.contig.gbs.tgz 4.2 kB 21/07/2012 02:13:00

NZ_AFWV00000000.contig.gff.tgz 393 kB 21/07/2012 02:13:00

NZ_AFWV00000000.contig.ptt.tgz 119 kB 21/07/2012 02:13:00

NZ_AFWV00000000.contig.rnt.tgz 1.5 kB 21/07/2012 02:13:00

NZ_AFWV00000000.contig.rpt.tgz 2.5 kB 21/07/2012 02:13:00

NZ_AFWV00000000.contig.val.tgz 1.6 MB 21/07/2012 02:13:00

NZ_AFWV00000000.gbk 4.7 kB 03/04/2012 03:19:00

NZ_AFWV00000000.rpt 257 B 03/04/2012 03:19:00

NZ_AFWV00000000.val 6.0 kB 03/04/2012 03:19:00

This folder looks a bit different to the finished genomes. It has a .gbk file, but you will notice it is quite small (4700 bytes), and if you look at it, you can see it has no sequence or annotation, only some meta-data and a reference to "WGS NZ_AFWV01000001-NZ_AFWV01000062".This means that this genome record consist of 62 other records; one for each contig in the assembly. These are stored in the compressed tar file NZ_AFWV00000000.contig.gbk.tgz as follows:

% tar ztf NZ_AFWV00000000.contig.gbk.tgz

NZ_AFWV01000001.gbk

NZ_AFWV01000002.gbk

NZ_AFWV01000003.gbk

...

NZ_AFWV01000061.gbk

NZ_AFWV01000062.gbk

So, in summary, instead of getting a nice neat single .gbk or .faa file for each replicon as you do for the completed genomes, you get a tarball of files for each assembly, with each file representing a contig in the draft genome. Any extra chromosomes or plasmids will be mixed in the bag of contigs.

3. Plasmids

The plasmids folder is not known to many people, it seems a bit hidden away frankly. It contains ~3000 completed plasmid sequences. Confusingly, ~1000 of these are duplicated from the Bacteria folder (as the plasmid was sequenced with its parent), while the other ~2000 are novel. Even more annoying is that the folder structure is different:

faa/ 21/07/2012 19:39:00

fna/ 21/07/2012 19:40:00

gbk/ 21/07/2012 19:41:00

...

plasmids.all.faa.tar.gz 43.2 MB 23/07/2012 19:43:00

plasmids.all.fna.tar.gz 75.1 MB 23/07/2012 19:43:00

plasmids.all.gbk.tar.gz 199 MB 23/07/2012 19:43:00

...

Now we have a folder for each file extension, which each contains 3000 files. So the files for a particular plasmid are spread out over multiple folders. Fortunately they provide compressed tar files of the whole archive to download directly: plasmids.all.gbk.tar.gz

4. Viruses

Some of you may be wondering why I am including Viruses in this story. Well, some viruses infect Bacteria too - they are called bacteriophage. There are ~3000 folders in the Viruses division, but not all of them are bacteriophage. A simple grep for "phage" suggests ~600 are bacterial viruses. The folder structure is the same as for the finished Bacteria genomes.

It is important to realise that most of these virus sequences are natively dsDNA and will also appear integrated into the chromosomal DNA of many of the entries in Bacteria and Bacteria_DRAFT.

22 comments:

Anonymous7 January 2013 at 18:33
As per point 1. , what if a folder has multiple prefix what it means for a completed genome?
ReplyDelete
Replies
Torsten Seemann8 January 2013 at 15:27
Sequer, that is a good question - I forgot to mention that.

For closed genomes, each unique prefix corresponds to each replicon in that organism ie. more than one chromosome, or plasmids.
ReplyDelete
Replies
Arun Prasanna13 January 2013 at 01:02
Hi,
Is it possible to download only desired genomes? say i just need complete list of bacillus folders?. I tried to download through
Filezilla, but unable to connect.
ReplyDelete
Replies
83years13 February 2013 at 00:51
Hi Torsten,

Do you know why the ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/ server does not have a all.fna.tar.gz file?

I'm trying to test a new bit of software for designing degenerate primers and the larger diversity of organisms in the draft folder would make a better test than the curated/finished genomes.

Thanks
ReplyDelete
Replies
Torsten Seemann22 February 2013 at 19:57
I don't know why the DRAFT folder is set out differently to the finished genomes. One reason for the lack of all.*.tar.gz files could be that there is 4x as many draft genomes and the files are just too big for people to reliably download.

You could use an FTP client that allows wildcards (eg. ncftp) so you can do "mget */*.fna" in the folder.
ReplyDelete
Replies
Siân26 November 2013 at 21:35
This comment has been removed by the author.
ReplyDelete
Replies
Siân26 November 2013 at 21:37
Hello,
I'm using the Salmonella enterica directory , but it contains many sub-directories with incomprehensible names which variously contain the .gbk etc files. All the sub folders no doubt represent the hundred of salmonella enterica serovars, but I don't understand how I'm supposed to be able to navigate to the correct folder- I can't work out what the system is here.
Do you have any idea? I would be very grateful for any light you could shed!
ReplyDelete
Replies
Unknown10 March 2014 at 18:15
Hello, do you have any idea why there are multiple gbk files for the same strain in the same folder? For example, Acetobacter_pasteurianus_386B_uid214433 folder has many gbk files included in it. I am writing a software that downloads specific bacterial genomes and parse them to store data in a database.
ReplyDelete
Replies
Siddharth25 July 2014 at 20:27
Hi. Keep up the great work, this blog has been really helpful. I'm new to the whole bioinformatics thing, so I am stumbling around in the dark a bit here. Could you tell me what the ASSEMBLY_BACTERIA folder is about ?
ReplyDelete
Replies
Palc9 February 2015 at 22:39
Does any of you have an idea how to download the genome sequences for the pathogenic bacteria only? Is there any resources?
ReplyDelete
Replies
Palc9 February 2015 at 22:41
This comment has been removed by a blog administrator.
ReplyDelete
Replies
Fauzan Ahmad13 November 2015 at 20:54
This may sound silly, but I could not navigate to NCBI by using command lftp ftp.ncbi.nlm.nih.govon Terminal. Why?
I have tried this in a course, it work!
ReplyDelete
Replies

Add comment