Variant finding is the generic term for finding differences between two genome sequences. These differences can take many forms, such as SNPs and small INDELs, large changes in DNA content caused by mobile elements, and structural changes like chromosomal inversions.
The genomes we want to compare could either be assemblies (complete or draft) or just sequencing reads (FASTQ files). The bulk of microbial variant finding tools focus on small differences (< 20 bp), and work by comparing a FASTQ sample to a assembled genome, typically called the "reference". A common use case is to sequence your isolate of interest, and see how it differs to the type strain in Genbank.
Calling SNPs with a one-liner
Let us assume we have a reference genome in FASTA format in the REF variable, the paired Illumina FASTQ files in R1 and R2, and the number of CPU cores you want to use in CPUS. Then, the follow "one-liner" will generate a VCF file with no intermediate files.
CPUS=4 REF=ref.fa R1=R1.fastq.gz R2=R2.fastq.gz minimap2 -a -x sr -t "$CPUS" "$REF" "$R1" "$R2" \ | samtools sort -l 0 --threads "$CPUS" \ | bcftools mpileup -Ou -B --min-MQ 60 -f "$REF" - \ | bcftools call -Ou -v -m - \ | bcftools norm -Ou -f "$REF" -d all - \ | bcftools filter -Ov -e 'QUAL<40 || DP<10 || GT!="1/1"' > variants.vcf
The reads are aligned to the reference, and sorted by coordinate. Instead of saving the BAM file, we pipe it directly to a series of BCF tool steps. Note the use of -l 0 and -Ou to keep the piped data in an uncompressed form, to avoid repeated compression/decompression steps. The --min-MQ 60 ensures only uniquely mapped reads are used. The final filter step removes low quality variant calls, heterzygous calls (this is haploid bacteria), and any regions with less than 10 supporting reads.
bcftools stats variants.vcf | grep '^SN' | cut -f3- number of samples: 1 number of records: 36618 number of no-ALTs: 0 number of SNPs: 36408 number of MNPs: 0 number of indels: 210 number of others: 0 number of multiallelic sites: 0 number of multiallelic SNP sites: 0
The one-liner found ~36,000 SNPs and 210 INDELs.
This is not exhaustive list; just those I have encountered that are useful for bacterial data.
- Snippy - this is my pipeline
- NASP - Northern Arizona SNP Pipeline
- CFSAN SNP Pipeline - used at CFSAN, FDA
- LYVE-Set - used at CDC
- Breseq - Barrick Lab
- SPANDX - Derek Saravoich and Erin Price
Variant caller only
- VarScan - Dan Koboldt
- GATK - The Broad Institute
- Freebayes - Erik Garrison
- Lofreq - Andreas Wilm, Niranjan Nagarajan
- SAMtools/BCFtools - The originals!
Variant calling in bacteria is both "easy" and "hard". If your sequencing data is high quality, not contaminated, and came from a pure colony, then finding most of variants will be relatively easy. The problems crop up when alignments aren't filtered correctly, reads are a mix of isolates, and the reference is too divergent from the isolate.
I have made many mistakes in SNP calling over the years. My own pipeline Snippy is now at version 4.3 and I'm still not completely happy with its performance, but it is "good enough" for its primary task which is to generate core-genome SNP phylogenies and find differences between mutants and wildtype in mutagenesis experiments.
Should you use the one-liner presented above? Probably not - the various available SNP pipelines have done a lot of tuning and offer extra useful features. However it is an excellent learning opportunity to examine each of the steps involved and the parameters that affect the output results.
It's always a trade-off between false positives and false negatives!