When you do an Illumina sequencing run, you need to choose between single-end (SE) or paired-end (PE) sequencing. When sequencing, we chop up our DNA into small fragments, and then ligate some adaptors. Then, for SE, we only sequence one end of a DNA fragment. For PE, we sequence both ends of the same fragment:
fragment ======================================== fragment + adaptors ~~~========================================~~~ SE read ---------> PE reads R1---------> <---------R2 unknown gap ....................
The two reads you get from PE sequencing are referred to as R1 and R2, and they come from the same piece of DNA. Usually the length of the fragment is much longer than the length of R1+R2, so there is a "gap" in between them. Although we don't know the sequence of DNA in between R1 and R2, we have still gained useful information from the knowledge that R1 and R2 are next to each other with a known orientation and distance apart.
Mind the gap
There is a lot of confusion about the gap of unknown bases. You will encounter terms like "insert size", "fragment size", "library size" and variations thereof. The term "insert" comes from a time before NGS existed, when cloning DNA in E.coli vectors was standard business.
PE reads R1---------> <---------R2 fragment ~~~========================================~~~ insert ======================================== inner mate ....................
The main confusion is with "insert size". The name itself suggests it is the unknown gap because it is "inserted" between R1 and R2, but this is misleading. It is more accurate to think of the insert as the piece of DNA inserted between the adaptors which enable amplification and sequencing of that piece of DNA. So the "insert" actually encompasses R1 and R2 as well as the unknown gap between them. The name for the gap itself is better named "inner mate distance" because it is self-descriptive and can vary depending on what read lengths you sequenced a DNA library with.
The Illumina MiSeq instrument has added to the confusion recently. Firstly, it can produce PE reads of length 250bp. Secondly, the Nextera preparation method is sensitive and can produce a lot of small fragments, shorter than 500bp. This results in R1 and R2 actually overlapping each other!
fragment ~~~========================================~~~ insert ======================================== R1 -------------------------> R2 <----------------------- overlap :::::::::: stitched SE read --------------------------------------->
This can actually be a desirable outcome, as you can stitch R1 and R2 together to make a super-long SE read, with extra confidence of the middle bases from consensus of the overlapping sections of R1 and R2.
If the distribution of fragment sizes is too low, or very wide, you can get the situation where not only do the reads overlap, but they are longer than the fragment itself! This causes R1 and R2 to read into the adaptors:
tiny fragment ~~~~========================~~~~ insert ======================== R1 --------------------------> R2 <-------------------------- read-through !!! !!!
If your MiSeq is configured properly, it will automatically trim/mask any adaptor sequence. This will be obvious by your FASTQ file containing reads of different length, or by the presence of lots of Ns at the 5' end of your reads. If it is not configured properly, you will get adaptors in your reads, and these will cause all sorts of problems with downstream applications. You should remove these using a read trimming tool.
Paired-end reads are a neat molecular biology trick. Remember that "insert" refers to the DNA fragment between the adaptors, and not the gap between R1 and R2. Instead we refer to that as the "inner mate distance". In some cases, when reads overlap, the inner mate distance can actually be negative. If you are using MiSeq data, you need to be vigilant about checking for adaptor read-through and overlapping reads.