The Genome Factory: Prokka - rapid prokaryotic annotation

Saturday, 28 April 2012

Prokka - rapid prokaryotic annotation

Prokka is a software tool I have written to annotate bacterial, archaeal and viral genomes. It is based on years of experience annotating bacterial genomes, both automatically and via manual curation.

It's main design considerations were to be:

fast

supports multi-threading
hierarchical search databases

simple to use

no compulsory parameters
bundled databases

clean

standards-compliant output files
pipeline-friendly interface

thorough

finds tRNA, rRNA, CDS, sig_peptide, tandem repeats, ncRNA
includes /gene and /EC_number where possible, not just /product
traceable annotation sources via /inference tags

useful

produce files close-to-ready for submission to Genbank
complete log file

The first release is a monolithic, but followable Perl script. It only uses core Perl modules, but has quite a few external tool dependencies, some of which I can't bundle due to licence restrictions. Eventually I hope to have a public web-server version, and a version of it in the Galaxy Toolshed.

It currently takes about 10 minutes on a quad Intel i7 for a typical 4 Mbp genome.

You can download it from here and read the manual here.

19 comments:

Alê8 June 2013 at 04:47
one of the best alternatives for bacterial genome annotation
ReplyDelete
Replies
Torsten Seemann8 June 2013 at 08:45
Thank you for the positive feedback.
ReplyDelete
Replies
Unknown18 June 2013 at 04:00
Thank you for your great tool! I often use it for my purposes. Could you tell me is there any tool for finding and correcting frameshifts???
ReplyDelete
Replies
Torsten Seemann18 June 2013 at 11:02
There are two types of frame-shifts: those that really are broken genes in the organism (so called "pseudo genes"), and those that are due to sequencing and assembly artefacts ("faux pseudo genes").

It is not always simple to distinguish between the two! The most common cause of false/fake/faux frame shifts is homopolymer sequencing errors from using Ion Torrent or 454 reads. You should re-sequence with Illumina and use it to correct all the mistakes. Or you could manually examine each frame shift and check for long homopolymers (usually > 5 A or T bases) and make a decision.

Some bacteria that I've worked on (Leptospira borgpetersenii, Mycobacterium ulcerans) have 100s of REAL psuedo-genes, so we had to be careful when deciding. Sometimes we validated with PCR/Sanger where it was important.

ReplyDelete
Replies
Unknown18 June 2013 at 19:27
Thanks for reply and explanations!
What do you think about this tool? Is it useful for automatic validation of annotated bacterial genome?
Microbial Genome Submission Check Tool
ftp://ftp.ncbi.nih.gov/genomes/TOOLS/subcheck/README.standalone
ReplyDelete
Replies
Torsten Seemann19 June 2013 at 15:41
When you submit a bacterial genome to Genbank, it goes through a lot of validation checks. From what I understand, some of those checks are done in the "tbl2asn" command (which Prokka puts into the .err output file. The "subcheck" tool seems to do a bunch of checks, some the same as tbl2asn, but a lot of others too. Genbank themselves probably use subcheck along with other pipelines, but I'm not exactly sure. Prokka used to produce quite compliant annotations, but in December 2012 they made their checking much stricter, and the annotations are no longer that compliant. I need to do more work here. But yes, 'subcheck' is a good tool to run BEFORE you submit to Genbank to avoid long delays in getting your genome through.
ReplyDelete
Replies
Unknown25 June 2013 at 13:52
Sorry for late response. Thanks a lot for answer!
ReplyDelete
Replies
Unknown25 October 2013 at 08:29
Hello, Thank you for the great tool. I have a question about the genus specific databases that are used within PROKKA, I have been working with E. coli, and have found that some strains are given gene ID within the genbank file, while others are not even thought they are highly related. The example I have used is the intimin gene within EHEC. The gene id is given as eaeA is 6 of the 7 strains that I took a quick look at, however, the strain it is not called in has intimin as the product as is >99% related to one of the strains that was given the gene identifier. Have you seen this before and are there any suggestions? Thanks
ReplyDelete
Replies
Lizzy Wilbanks31 July 2014 at 04:26
Thanks for this great tool! So useful!! One thing that might be a nice addition for future releases would be providing more of the information from minced about the CRISPR regions - maybe as a separate output file? I've been re-running this to get the locations of the direct repeats and spacer sequences.
ReplyDelete
Replies
Unknown25 November 2015 at 09:17
This comment has been removed by the author.
ReplyDelete
Replies
Unknown25 November 2015 at 09:17
This comment has been removed by the author.
ReplyDelete
Replies
Unknown25 November 2015 at 09:18
This comment has been removed by the author.
ReplyDelete
Replies
Unknown25 November 2015 at 09:18
Is there a way we can detect Transposable elements from PROKKA.
ReplyDelete
Replies

Add comment