Sunday, 6 October 2013

Understanding SNPs and INDELs in microbial genomes


Variants are differences between two genomes. Here I describe two important types of nucleotide-level variants (SNPs and INDELs) and how they affect microbial genomes.


A SNP is a single nucleotide polymorphism (pronounced "snip"). This is when there is single base which differs between two genomes, and the DNA around that base is otherwise unchanged.


In coding-dense genomes like microbes, most SNPs will be within protein coding regions. Thus the SNP will change a codon, and potentially change the amino acid it codes for. If the amino acid coded for does not change, it is called a synonymous SNP (as the codon is a 'synonym' for the amino acid). If it does change, it is called a non-synonymous SNP.

                         ..|                 .|.  
                       SNP(T=>C)          SNP(C=>A)
                         ..|                 .|.
Genome 1 |  AA |  M   K   V   D   D   Q   H   S   P   *
Genome 2 |  AA |  M   K   V   D   D   Q   H   Y   P   *
                          |                   |
                         SYN               NON-SYN

A non-synonymous SNP can drastically alter the function of a protein because sometimes a single amino acid difference can modify the structure/shape of a protein. It could even affect the RNA transcript itself, causing it to be translated at lower efficiency or not at all. SNPs in promoter regions (-35, -10) and the ribosome binding site (RBS) can have similar effects.

A good rule of thumb is that SNPs in the 3rd position in a codon often produce synonymous SNPs, due to the particular pattern of degeneracy in the genetic code. If two SNPs occur right next to each other, the variant is sometimes called a multiple nucleotide polymorphism (MNP).


An INDEL (INsertion/DELetion) is where a single base has been deleted, or inserted into one genome relative to another. It is a symmetrical relationship, as a deletion in one corresponds to an insertion in another. I reckon it should be called a deletion/insertion polymorphism (DIP) too, so we can all snack on SNPs and DIPs :-)


While a SNP will either change a protein slightly or not at all, an INDEL will nearly always have a drastic affect on a protein. Because codons are groups of 3 nucleotides, removing/adding 1 nucleotide messes everything up; this is called a frame-shift mutation. This usually results in either a protein being extended, or truncated.

Genome 1 |  AA |  M   K   V   D   D   Q   H   S   P   *

Genome 2 | DNA | ATG AAA GTC ATG ACC AGC ATT ACC CAT GA? ??? ??? ???
Genome 2 |  AA |  M   K   V   M   T   S   I   T   H   X   X   X   X
                                            STOP Loss & read-through

In the previous case, the protein was extended into a new frame, causing it to have a different 3' end than normal. It will eventually hit another stop codon just by chance. In the case below, if a premature STOP codon is introduced, then we end up with a shorter reading frame.

Genome 3 |  AA |  M   K   V   E    *   P   A   L   P   M
                         STOP Gain & truncation

Because the terminator sequence is no longer where it needs to be, these genes may not every be transcribed, or translated. In that case they are called pseudo-genes.

If multiple deletions (or multiple insertions) occur together, it is sometimes called a micro-indel (or micro-insertion). A micro-INDEL of length 3 occasionally occurs in bacterial evolution, as it keeps the protein translation in frame.

Structural Variation

SNPs and INDELs are about low-level genomic variation. It is also possible to look at structural variants which affect the genome at larger scales. Events like gene duplications, tandem repeats, transposon insertions, inversions, and other chromosomal rearrangements are all important to consider, but this post will leave those issues for another day.


SNPs and INDELs are small differences between genomes. They are important drivers of bacterial evolution, by modifying how or whether genes are transcribed and translated. In my next post I will introduce my new tool Snippy for discovering these differences efficiently.

Friday, 30 August 2013

How Spades differs from Velvet


Those of us working in bacterial genomics are all to familiar with de novo genome assembly. One of the first accessible and practical tools for bacterial genome assembly was Velvet. My group use Velvet a lot, and wrote the popular VelvetOptimiser software.

Since then, many alternatives to Velvet have appeared, including ABYSS, SOAPdenovo, ALLPATH-LG, SGA, Ray, and many others. The motivation for some of these alternatives was to improve performance and decrease RAM usage when assembling large, polyploid organisms which Velvet was not really designed to handle.

Despite these alternatives, Velvet has still thrived due to it having a strong user community, and still giving good, usable assemblies. But there is always room for improvement and new ideas, and I believe an excellent option for bacterial assemblies currently is SPAdes. It recently ranked very well in the GAGE-B assessment and in this post I will explain its relationship to Velvet in broad terms.

What's the same?

SPAdes is a de Bruijn graph based assembler, just like most short read assemblers, including Velvet. It breaks reads into fixed-size k-mers, builds a graph, cleans the graph, then finds paths through the graph. These paths end up as contigs.

SPAdes was originally intended for assembling MDA data. This is data that comes from single-cell sequencing using the multiple displacement amplification method for tiny amounts of input DNA. This produces wildly varying genome coverage, something which existing assemblers were not able to deal with well. But SPAdes by default now works with regular data, but it is neat that it can support MDA when required.

The target data source for SPAdes is Illumina reads. Like all de Bruijn graph based assemblers, they work best with shorter, high quality reads where indels are rare. For PGM and 454 data I would look elsewhere.

What's different?

The authors would argue, and the GAGE-B assessment supports the argument, that SPAdes does a better job than Velvet and other assemblers on microbial genome data. I have not had extensive experience with it yet, but have used it enough to now recommend it to others and trust it on my own data sets (well, as much as I trust any assembler!).

But there is a good reason SPAdes does better. It is really multiple tools in one. This integrated approach makes things much simpler to incorporate in pipelines. Here are the key steps SPAdes makes, as best I understand them:

  1. Read error correction based on k-mer frequencies using BayesHammer
  2. De Bruijn graph assembly at multiple k-mer sizes, not just a single fixed one.
  3. Merging of different k-mer assemblies (good for varying coverage)
  4. Scaffolding of contigs from PE/MP reads
  5. Repeat resolution from PE/MP data using rectangle graphs
  6. Contig error correction based on aligning the original reads with BWA back to contigs

Just like Velvet, it can use multiple threads for some parts of the algorithm. SPAdes produces a final "contigs.fasta" and "scaffolds.fasta" file, and a detailed log file so you can reconstruct your results. I think it is using more sophisticated dynamic methods for estimating k-mer coverage and cutoffs. Of course it takes longer to run than Velvet, but it is doing a lot more than Velvet does.


The SPAdes software is easy to install, has a nice clean interface, and follows my minimum standards for bioinformatics software. The authors are actively developing it, and respond to bug reports and questions. The results are good, and the computational requirements are reasonable. It is well worth trying on your own microbial data. So go and download it, try it out, and email your feedback today.

Sunday, 25 August 2013

Paired-end read confusion - library, fragment or insert size?


When you do an Illumina sequencing run, you need to choose between single-end (SE) or paired-end (PE) sequencing. When sequencing, we chop up our DNA into small fragments, and then ligate some adaptors. Then, for SE, we only sequence one end of a DNA fragment. For PE, we sequence both ends of the same fragment:

fragment                  ========================================
fragment + adaptors    ~~~========================================~~~
SE read                   --------->
PE reads                R1--------->                    <---------R2
unknown gap                         ....................

The two reads you get from PE sequencing are referred to as R1 and R2, and they come from the same piece of DNA. Usually the length of the fragment is much longer than the length of R1+R2, so there is a "gap" in between them. Although we don't know the sequence of DNA in between R1 and R2, we have still gained useful information from the knowledge that R1 and R2 are next to each other with a known orientation and distance apart.

Mind the gap

There is a lot of confusion about the gap of unknown bases. You will encounter terms like "insert size", "fragment size", "library size" and variations thereof. The term "insert" comes from a time before NGS existed, when cloning DNA in E.coli vectors was standard business.

PE reads      R1--------->                    <---------R2
fragment     ~~~========================================~~~
insert          ========================================
inner mate                ....................

The main confusion is with "insert size". The name itself suggests it is the unknown gap because it is "inserted" between R1 and R2, but this is misleading. It is more accurate to think of the insert as the piece of DNA inserted between the adaptors which enable amplification and sequencing of that piece of DNA. So the "insert" actually encompasses R1 and R2 as well as the unknown gap between them. The name for the gap itself is better named "inner mate distance" because it is self-descriptive and can vary depending on what read lengths you sequenced a DNA library with.

Overlapping reads

The Illumina MiSeq instrument has added to the confusion recently. Firstly, it can produce PE reads of length 250bp. Secondly, the Nextera preparation method is sensitive and can produce a lot of small fragments, shorter than 500bp. This results in R1 and R2 actually overlapping each other!

fragment          ~~~========================================~~~
insert               ========================================
R1                   ------------------------->                    
R2                                   <-----------------------
overlap                              ::::::::::
stitched SE read     --------------------------------------->

This can actually be a desirable outcome, as you can stitch R1 and R2 together to make a super-long SE read, with extra confidence of the middle bases from consensus of the overlapping sections of R1 and R2.

Adaptor read-through

If the distribution of fragment sizes is too low, or very wide, you can get the situation where not only do the reads overlap, but they are longer than the fragment itself! This causes R1 and R2 to read into the adaptors:

tiny fragment       ~~~~========================~~~~
insert                  ========================
R1                      -------------------------->                    
R2                   <--------------------------
read-through         !!!                        !!!

If your MiSeq is configured properly, it will automatically trim/mask any adaptor sequence. This will be obvious by your FASTQ file containing reads of different length, or by the presence of lots of Ns at the 5' end of your reads. If it is not configured properly, you will get adaptors in your reads, and these will cause all sorts of problems with downstream applications. You should remove these using a read trimming tool.


Paired-end reads are a neat molecular biology trick. Remember that "insert" refers to the DNA fragment between the adaptors, and not the gap between R1 and R2. Instead we refer to that as the "inner mate distance". In some cases, when reads overlap, the inner mate distance can actually be negative. If you are using MiSeq data, you need to be vigilant about checking for adaptor read-through and overlapping reads.


Friday, 9 August 2013

Minimum standards for bioinformatics command line tools

I don't consider myself a good software engineer, or a good tester, a good documenter, or even that good a programmer. But I have used, and (tried to) installed a LOT of bioinformatics software over the last 12 years. I've also released a lot of software, and I try to make it as painless to use as possible. From these experiences, I bring you my "Ten rules for bioinformatics software".

1. Print something if no parameters are supplied

Unless your tool is a filter which works by manipulating stdin to stdout, you should always print out something (some help text, ideally) if the user runs your tool without all the required parameters. Just exiting quietly isn't helping anyone.

% biotool
Please use the --help option to get usage information.

2. Always have a "-h" or "--help" switch

The Unix tradition is for all commands to have a "-h" or "--help" switch, which when invoked, prints usage information about the command. Most languages come with a getopt() type library, so there is no excuse for not supporting this.

% biotool -h
Usage: biotool [options] <file.fq>
--rc       reverse complement
--trim nn  trim <nn> bases from 3' end first
--mask     remove vector sequence contaminant

3. Have a "-v" or "--version" switch

Many bioinformatics tools today are used as part of larger pipelines, or put into the Galaxy toolshed. Because compatibility is dependent on the version of your tool being used, you should have a simple, machine-parseable way to identify what version of tool you have.

% biotool --version
biotool 1.3a

4. Use stderr for messages and errors

If you need to print an error message, are just printing out progress or log information,  try and use stderr rather than stdout. Try to reserve stdout for use as your output channel, so that it can be used in Unix pipes to avoid temporary files. 

% biotool reads.fq | fq2fa > clean.fq
biotool: processing reads.fq
fq2fa: converted 423421 reads

5. Validate your parameters

If you have command line options, do some validation or sanity checking on them before letting them through to your critical code. Many getopt() libraries support basic validation, but ultimately it is not that difficult to have a preamble with some "if not XXX { print ERROR ; exit }" clauses.

% biotool --trim -3 reads.fq 
Error: --trim must be an integer > 0

6. Don't hard-code any paths

Often the tool you write depends on some other files, such as config files or database/model files. The easiest, but wrong and annoying, thing to do is just put

% biotool --mask reads.fq 
Error: can't load /home/steven/work/biotool/data/vector.seq

7. Don't pollute the command-line name space

You've come up with a new tool called "BioTool". The command you want everyone to invoke is called "biotool", but it is just a master script which runs lots of other tools. Unfortunately you used lots of generic names like "fasta2fastq", "convert", "filter" .. and so on, and you've put them all in the same folder at the main "biotool" script. So when I install BioTool, my PATH gets filled with rubbish. Please don't do this.

% ls -1 /opt/BioTool/
convert      # whoops, clashes with ImageMagick! # hello Titus :-)
diff         # whoops, clashes with standard Unix tool!      # <face-palm>

The first solution is to prefix all your sub-tools and helper scripts with "biotool". The second solution, if they are scripts only, is to not make them executable (so they don't go in PATH) and invoke the via the interpreter (perl, python, ...) explicitly from biotool. The third solution is too put them all in a separate folder (eg. auxiliary/, scripts/ ...) and explicitly call them (but take note of #6 above).

8. Don't distribute bare JAR files

If your tool is written in Java and is distributed as a JAR file, please write a simple shell wrapper script to make it simple to invoke. The three lines below are all you need (in the simple case) and you will make your users much happier.

PREFIX=$(dirname $0)
java -Xmx500m -jar $PREFIX/BioTool.jar $*

9. Check that your dependencies are installed

I've installed BioTool, and I start running it, and all looks good. Then 2 hours later it spits out an error like "error: can't run sff2CA". This could all be avoided if biotool checked all the external tools it needed before it commenced, and save your users associating your software with pain. 

% biotool --stitch R1.fq R2.fq
This is biotool 1.3a
Loaded config
Checking for 'bwa': found /usr/bin/bwa
Checking for 'samtools': ERROR - could not find 'samtools'

10. Be strict if you are still a Perl tragic like me

If you're old like me and Perl is still your native tongue,  at least play it a little bit safer by starting all your scripts with the following lines:

#!/usr/bin/env perl
use strict;
use warnings;
use Fatal;

I'll shut up now :-)

Thursday, 1 August 2013

How to pronounce "de Bruijn"

For those who work in bioinformatics the phrase "de Bruijn graph" will be familiar. Although it technically refers to a mathematical construct, it is commonly used in bioinformatics to refer to the data structure used to store DNA read connections in many de novo assembly algorithms such as EULER (*), Velvet, Gossamer and lots of other implementations.

Although we all agree the de Bruijn graph is a "really neat idea", what we don't seem to agree on is how to pronounce it!

Now, the de Bruijn graph is named after the mathematician Nicolaas Govert de Bruijn. Nicolaas was from the Netherlands, so he is Dutch. The word "bruijn" essentially means the colour brown - although modern spelling would be "bruin".

Here are the various ways I have heard it pronounced, some of them in successive talks at ISMB 2013 in the same session even:

  • broon
  • broo-en, brewin' 
  • bra-jen, brar-djen
  • brin
  • brun
  • bruggin, bruggen
  • broin, broyn

All of these are wrong. Even this BioStar post appears to be wrong...

The closest sounding word in English is "BROWN".
The closest sounding word in German is probably "BRAUN".

Here is a recording of his whole name being pronounced in Dutch. 
If you want to add some authenticity, try and roll the "R" a little bit.

Hopefully this puts an end to the confusion once and for all.

(*) "Euler" is another German mathematician whose name is often mispronounced in English. Most people say "yool-er" but it should be "oiler" (audio file). So assembly software takes an "Oil-air-e-an Path through a de Brown graph" :-)

Tuesday, 25 June 2013

Extending a Windows XP partition in VirtualBox

The Problem

My desktop computer runs Ubuntu Linux (on a Mac Mini!), but sometimes I am forced to use Microsoft Word to edit a document that LibreOffice/OpenOffice Writer can't handle, or use iTunes (ech.) to fix the kids' iPad. To do this I use the excellent VirtualBox software to run an Windows XP virtual machine within Ubuntu. I originally created a 10GB virtual disk image, but soon realised that wasn't enough space if I wanted to also backup the 64GB iPad completely. Fortunately, VDI disk images can be increased in size, but my Googling resulted in a variety of methods which were all a bit confusing, so this post summarises what I had to do for my exact situation, which I didn't come across anywhere else.

My Situtation

  • Aim: Grow my disk image storage from 10GB to 80GB
  • Partition type: NTFS
  • Host operating system: Ubuntu 13.04
  • Virtual operating system: Windows XP SP3

How to do it

  1. Locate the disk image: On Ubuntu, VirtualBox stores your machine data in "$HOME/VirtualBox VMs" by default. In there find the appropriate .VDI file. Mine was called "WinXP.vdi".
  2. Expand the VDI image: 
    On Ubuntu you use the "VBoxManage" command to do this. You do not need to "sudo" or be root as the files are all owned by you. This is the command I used to increase it to ~80GB:
    VBoxManage modifyhd "$HOME/VirtualBox VMs/WinXP.vdi" 81920
  3. Extend the NTFS partition: Although you've extended the physical disk image, Windows XP still only thinks it has 10GB because the original partition on the disk was 10GB. We need to extend the NTFS partition into the new space we made. Because I only have one partition which has the operating system and all my data, you can't do this from within Windows XP itself. You need to boot something else to modify the partition. I downloaded the free GParted Live CD ISO image and attached it to my virtual machine using the Settings and Storage menu on Virtual Box. I also made sure the virtual CD-ROM was in the boot order before the VDI image.
    After starting the machine, the GParted interface eventually booted. I could see my 80GB disk with a 10GB NTFS partition on it. I chose the partition, and clicked "Resize" and typed in the full size of the disk (81920 MB). After clicking extend, it only took a few seconds to complete.
  4. Reboot Windows:
    I shut down the machine, removed the ISO image from the Storage menu, and restarted it. Windows XP started up "
    chkdsk" as it noticed something had changed :-) I had no errors or issues, and examination of drive C: showed it was now 80GB in size.  Done!


Virtual machines and virtual disk images are awesome. I'm glad to be finished with meddling with multi-boot and repartitioning real disks. I hope this post was helpful to you.

Saturday, 30 March 2013

Bacterial genome annotation systems


You get a bacterial isolate. You sequence it. You manage to get some contigs after mucking around with some de novo assembly software. Now what? Annotation of course! Your FASTA file is teeming with lifeless chunks of bacterial DNA yearning to be adorned with insightfully labelled features, so it can get some more attention from you, and maybe even be reunited with some old friends in Genbank/ENA. If this sounds familiar, then this blog post is for you.

What is genome annotation?

Genome annotation is the process of identifying features of interest on a genome sequence. Some of the   features relevant to bacterial genomes are protein coding genes, non-coding RNAs, and operons. Features can have all sorts of useful information associated with them in addition to their genomic location and feature type. For example, a protein-coding gene annotation could include items such as the predicted protein product, whether it has a signal peptide, a gene abbreviation and an enzyme classification number. The accuracy and richness of a genome annotation is important, and sometimes critical, to downstream biological interpretation.

In the old days, a basic ORF finder would be run over the contigs. Then the truly dedicated curators would comb over the ORFs, trim back to good looking start codon sites, delete spurious looking ORFs, and so on. Later gene predictor software and BLASTX helped bootstrap this process further. Now there are various "automatic annotation" systems which do a reasonably good job. Manual refinement of the automatic annotation can then be done using curation applications.

Below I list the tools I am aware of for performing and curating bacterial genome annotation. If I've missed any please let me know and I will add them.

Web submission systems

Standalone systems

Curation systems

  • Manatee (web interface + database backend)
  • Artemis (local app, can be combined with Chado SQL backend)
  • Apollo (local app, can be combined with Chado SQL backend)
  • Wasabi (my old non-public awful CGI/Perl/make mess that we still use internally)


Beware of systems claiming to do "microbial" annotation. Most of them are only designed for annotating bacteria. They will perform poorly on viruses, fungi and other microbes.