Sunday 14 October 2018

A Unix one-liner to call bacterial variants

Introduction

Variant finding is the generic term for finding differences between two genome sequences. These differences can take many forms, such as SNPs and small INDELs, large changes in DNA content caused by mobile elements, and structural changes like chromosomal inversions.

The genomes we want to compare could either be assemblies (complete or draft) or just sequencing reads (FASTQ files). The bulk of microbial variant finding tools focus on small differences (< 20 bp), and work by comparing a FASTQ sample to a assembled genome, typically called the "reference". A common use case is to sequence your isolate of interest, and see how it differs to the type strain in Genbank.

Calling SNPs with a one-liner

Let us assume we have a reference genome in FASTA format in the REF variable, the paired Illumina FASTQ files in R1 and R2, and the number of CPU cores you want to use in CPUS. Then, the follow "one-liner" will generate a VCF file with no intermediate files.

CPUS=4
REF=ref.fa
R1=R1.fastq.gz
R2=R2.fastq.gz

minimap2 -a -x sr -t "$CPUS" "$REF" "$R1" "$R2" \
 | samtools sort -l 0 --threads "$CPUS" \
 | bcftools mpileup -Ou -B --min-MQ 60 -f "$REF" - \
 | bcftools call -Ou -v -m - \
 | bcftools norm -Ou -f "$REF" -d all - \
 | bcftools filter -Ov -e 'QUAL<40 || DP<10 || GT!="1/1"' 
 > variants.vcf

The reads are aligned to the reference, and sorted by coordinate. Instead of saving the BAM file, we pipe it directly to a series of BCF tool steps. Note the use of -l 0 and -Ou to keep the piped data in an uncompressed form, to avoid repeated compression/decompression steps. The --min-MQ 60 ensures only uniquely mapped reads are used. The final filter step removes low quality variant calls, heterzygous calls (this is haploid bacteria), and any regions with less than 10 supporting reads.

Here is a summary of the results of this one-liner using some data for Pasteurella multocida. You can download with these links: REF (.fa.gz), R1 (.fastq.gz) and R2 (.fastq.gz).

bcftools stats variants.vcf | grep '^SN' | cut -f3-

number of samples:      1
number of records:      36618
number of no-ALTs:      0
number of SNPs: 36408
number of MNPs: 0
number of indels:       210
number of others:       0
number of multiallelic sites:   0
number of multiallelic SNP sites:       0

The one-liner found ~36,000 SNPs and 210 INDELs.

Existing software

This is not exhaustive list; just those I have encountered that are useful for bacterial data.

Full pipelines

Variant caller only

Conclusion

Variant calling in bacteria is both "easy" and "hard". If your sequencing data is high quality, not contaminated, and came from a pure colony, then finding most of variants will be relatively easy. The problems crop up when alignments aren't filtered correctly, reads are a mix of isolates, and the reference is too divergent from the isolate.

I have made many mistakes in SNP calling over the years. My own pipeline Snippy is now at version 4.3 and I'm still not completely happy with its performance, but it is "good enough" for its primary task which is to generate core-genome SNP phylogenies and find differences between mutants and wildtype in mutagenesis experiments.

Should you use the one-liner presented above? Probably not - the various available SNP pipelines have done a lot of tuning and offer extra useful features. However it is an excellent learning opportunity to examine each of the steps involved and the parameters that affect the output results.

It's always a trade-off between false positives and false negatives!

Further reading