If you click on the URL, there is a big list of folders, and it does look like a mess. But for those of us in microbial genomics there are a few key folders you should know about, and probably even have mirrored on your own servers:
Most of my work is in bacterial genomics, so I'll discuss the contents of the first four folders only. I'll leave the last two to an experienced mycogenomicist.
This directory contains a folder for each completed bacterial genome. That is, the genome has been finished to a single DNA sequence per replicon (usually just one chromosome) and is fully annotated. There are currently around 1000 completed bacterial genomes, of which I've been involved in about 10.
Let's have a look at one. I chose Dickeya dadantii because it's a lovely sounding alliteration for a plant pathogen: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Dickeya_dadantii_3937_uid52537/
NC_014500.asn 15.9 MB 13/06/2012 12:11:00
NC_014500.faa 1.7 MB 13/06/2012 12:11:00
NC_014500.ffn 4.5 MB 19/11/2011 11:00:00
NC_014500.fna 4.8 MB 29/09/2010 10:00:00
NC_014500.frn 49.1 kB 29/09/2010 10:00:00
NC_014500.gbk 16.7 MB 13/06/2012 12:11:00
NC_014500.gff 1.8 MB 03/04/2012 03:41:00
NC_014500.ptt 407 kB 10/03/2012 13:18:00
NC_014500.rnt 7.1 kB 29/09/2010 10:00:00
NC_014500.rpt 281 B 25/04/2011 10:00:00
NC_014500.val 7.0 MB 13/06/2012 12:11:00
You can see a bunch of files, all with the same prefix (NC_104500) and a bunch of different suffixes or file extensions (gbk, gff) - some of which should be familiar to you. The NC_014500 is the RefSeq accession ID for the single chromosome of Dickeya dadantii. The most important files are:
- fna : FASTA file of the chromosomal sequence (think "n" = nucleotide)
- gbk : Genbank file containing meta-data, sequence, and annotations
- gff : GFF3 file containing annotations only (coordinates relative to the .fna file)
- faa : FASTA file of the translated coding regions (proteins) annotated in the .gbk/.gff (think "aa" = amino acids)
This directory contains folders for each draft bacterial genome. That is, the genome has been de novo assembled into contigs/scaffolds (eg. using Newbler for 454 data) but has not been, and probably never will be, finished. They are usually annotated, either by the submitter or automatically by NCBI, but sometimes there may be only sequences. There is about 2600 draft genomes currently.
Here's the contents of the Thiocapsa marina str. 5811 genome folder - it's a purple sulphur coccus from the Mediterranean Coast if you are interested.
NZ_AFWV00000000.asn 13.5 kB 03/04/2012 03:19:00
NZ_AFWV00000000.contig.asn.tgz 1.7 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.faa.tgz 1.0 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.ffn.tgz 1.5 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.fna.tgz 1.6 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.frn.tgz 4.1 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.gbk.tgz 4.6 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.gbs.tgz 4.2 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.gff.tgz 393 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.ptt.tgz 119 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.rnt.tgz 1.5 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.rpt.tgz 2.5 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.val.tgz 1.6 MB 21/07/2012 02:13:00
NZ_AFWV00000000.gbk 4.7 kB 03/04/2012 03:19:00
NZ_AFWV00000000.rpt 257 B 03/04/2012 03:19:00
NZ_AFWV00000000.val 6.0 kB 03/04/2012 03:19:00
This folder looks a bit different to the finished genomes. It has a .gbk file, but you will notice it is quite small (4700 bytes), and if you look at it, you can see it has no sequence or annotation, only some meta-data and a reference to "WGS NZ_AFWV01000001-NZ_AFWV01000062".This means that this genome record consist of 62 other records; one for each contig in the assembly. These are stored in the compressed tar file NZ_AFWV00000000.contig.gbk.tgz as follows:
% tar ztf NZ_AFWV00000000.contig.gbk.tgz
So, in summary, instead of getting a nice neat single .gbk or .faa file for each replicon as you do for the completed genomes, you get a tarball of files for each assembly, with each file representing a contig in the draft genome. Any extra chromosomes or plasmids will be mixed in the bag of contigs.
The plasmids folder is not known to many people, it seems a bit hidden away frankly. It contains ~3000 completed plasmid sequences. Confusingly, ~1000 of these are duplicated from the Bacteria folder (as the plasmid was sequenced with its parent), while the other ~2000 are novel. Even more annoying is that the folder structure is different:
faa/ 21/07/2012 19:39:00
fna/ 21/07/2012 19:40:00
gbk/ 21/07/2012 19:41:00
plasmids.all.faa.tar.gz 43.2 MB 23/07/2012 19:43:00
plasmids.all.fna.tar.gz 75.1 MB 23/07/2012 19:43:00
plasmids.all.gbk.tar.gz 199 MB 23/07/2012 19:43:00
Now we have a folder for each file extension, which each contains 3000 files. So the files for a particular plasmid are spread out over multiple folders. Fortunately they provide compressed tar files of the whole archive to download directly: plasmids.all.gbk.tar.gz
Some of you may be wondering why I am including Viruses in this story. Well, some viruses infect Bacteria too - they are called bacteriophage. There are ~3000 folders in the Viruses division, but not all of them are bacteriophage. A simple grep for "phage" suggests ~600 are bacterial viruses. The folder structure is the same as for the finished Bacteria genomes.
It is important to realise that most of these virus sequences are natively dsDNA and will also appear integrated into the chromosomal DNA of many of the entries in Bacteria and Bacteria_DRAFT.