Saturday, 28 April 2012

Prokka - rapid prokaryotic annotation

Prokka is a software tool I have written to annotate bacterial, archaeal and viral genomes. It is based on years of experience annotating bacterial genomes, both automatically and via manual curation. 

It's main design considerations were to be:
  • fast
    • supports multi-threading
    • hierarchical search databases
  • simple to use
    • no compulsory parameters
    • bundled databases
  • clean
    • standards-compliant output files
    • pipeline-friendly interface
  • thorough
    • finds tRNA, rRNA, CDS, sig_peptide, tandem repeats, ncRNA
    • includes /gene and /EC_number where possible, not just /product
    • traceable annotation sources via /inference tags
  • useful
    • produce files close-to-ready for submission to Genbank
    • complete log file

The first release is a monolithic, but followable Perl script. It only uses core Perl modules, but has quite a few external tool dependencies, some of which I can't bundle due to licence restrictions. Eventually I hope to have a public web-server version, and a version of it in the Galaxy Toolshed. 

It currently takes about 10 minutes on a quad Intel i7 for a typical 4 Mbp genome.

You can download it from here and read the manual here.


  1. one of the best alternatives for bacterial genome annotation

  2. Thank you for the positive feedback.

    1. Hi Torsten,

      What do you think about the inclusion of the phiSPY for prophage detection, and ISsaga for Insertion Sequences into prokka ?

      I'm the ISsaga developer, and if you have interest we can discuss about this possibility

    2. My understanding is that these are web-based tools, not open source command line?

      Also, I though the ISFinder database was not open either. My colleagues have requested access to the raw data many times but have got no reply or been rejected.

    3. I've created a script to download what is available on their website...

    4. This comment has been removed by the author.

  3. Thank you for your great tool! I often use it for my purposes. Could you tell me is there any tool for finding and correcting frameshifts???

  4. There are two types of frame-shifts: those that really are broken genes in the organism (so called "pseudo genes"), and those that are due to sequencing and assembly artefacts ("faux pseudo genes").

    It is not always simple to distinguish between the two! The most common cause of false/fake/faux frame shifts is homopolymer sequencing errors from using Ion Torrent or 454 reads. You should re-sequence with Illumina and use it to correct all the mistakes. Or you could manually examine each frame shift and check for long homopolymers (usually > 5 A or T bases) and make a decision.

    Some bacteria that I've worked on (Leptospira borgpetersenii, Mycobacterium ulcerans) have 100s of REAL psuedo-genes, so we had to be careful when deciding. Sometimes we validated with PCR/Sanger where it was important.

  5. Thanks for reply and explanations!
    What do you think about this tool? Is it useful for automatic validation of annotated bacterial genome?
    Microbial Genome Submission Check Tool

  6. When you submit a bacterial genome to Genbank, it goes through a lot of validation checks. From what I understand, some of those checks are done in the "tbl2asn" command (which Prokka puts into the .err output file. The "subcheck" tool seems to do a bunch of checks, some the same as tbl2asn, but a lot of others too. Genbank themselves probably use subcheck along with other pipelines, but I'm not exactly sure. Prokka used to produce quite compliant annotations, but in December 2012 they made their checking much stricter, and the annotations are no longer that compliant. I need to do more work here. But yes, 'subcheck' is a good tool to run BEFORE you submit to Genbank to avoid long delays in getting your genome through.

  7. Sorry for late response. Thanks a lot for answer!

  8. Hello, Thank you for the great tool. I have a question about the genus specific databases that are used within PROKKA, I have been working with E. coli, and have found that some strains are given gene ID within the genbank file, while others are not even thought they are highly related. The example I have used is the intimin gene within EHEC. The gene id is given as eaeA is 6 of the 7 strains that I took a quick look at, however, the strain it is not called in has intimin as the product as is >99% related to one of the strains that was given the gene identifier. Have you seen this before and are there any suggestions? Thanks

    1. Adam - the genus databases included a simple dumps of all the proteins from closed bacteria genomes in RefSeq. Sometimes they work great at domain specific annotations, other times they do badly. The genus databases are still second priority compared to the primary swissprot based database. This could be where the discrepancy might be.

      The best solution for you is to the use the --proteins option. The proteins here will have TOP priority, even before swissprot. ie. 1: --proteins, 2: swissprot. 3: genus, 4: all the HMMs.

      In the prokka/db/ folder there is a folder called "trusted" with a file called "EcoCyc-16.5" ... I reckon you might want to add "--proteins /path/to/prokka/db/trusted/EcoCyc-16.5" to your Prokka command line?

  9. Thanks for this great tool! So useful!! One thing that might be a nice addition for future releases would be providing more of the information from minced about the CRISPR regions - maybe as a separate output file? I've been re-running this to get the locations of the direct repeats and spacer sequences.

    1. Lizzy can you give me some more info about what you need here?

  10. This comment has been removed by the author.

  11. This comment has been removed by the author.

  12. This comment has been removed by the author.

  13. Is there a way we can detect Transposable elements from PROKKA.