DNA sequencing
DNA (short for deoxyribonucleic acid) is a long string-like molecule made up of subunits called nucleotides (you may also see these called bases). DNA acts as a ‘code’, containing the biological information needed to for an organism to develop and operate in the form of genes. A gene is a length of DNA that encodes a specific protein.
The information in DNA is encoded by the sequence of four chemical nucleotides: adenine (A), thymine (T), cytosine (C), and guanine (G). In a DNA molecule, nucleotides are arranged as two long strands forming a spiral known as a ‘double helix’. Each nucleotide forms a pair with the nucleotide on the opposite strand. Nucleotides pair in a specific way: A always pairs with T, and C always pairs with G.
DNA sequencing is the process of determining the sequence of nucleotides in a piece of DNA. Establishing the DNA sequence of an organism is key to understanding the function of genes.
Bacterial genome sequencing
Genomic data is data related to the content, structure, and function of an organism’s genome. The study of genome data has transformed our understanding of how bacteria evolve and function, and of how clinically relevant characteristics such as antimicrobial resistance arise and spread. The sequence of all the DNA found in a given organism is termed the ‘genome sequence’.
The first bacterial genome sequence was determined in 1995 by Robert Fleischmann and colleagues, for the bacterium Haemophilus influenzae. To do this, they used the Sanger sequencing method. They discovered that the complete genome sequence for H. influenzae consists of around 1,700 genes encoded by over 1.8 million bases.
Sanger sequencing
Sanger sequencing was developed by Frederick Sanger and colleagues in 1977 and commercialised in the 1980s.
Sanger sequencing involves breaking the DNA of a genome into many smaller pieces, each of around 500 – 1000 bases, sequencing those pieces, then aligning the overlapping regions in order to assemble the entire DNA sequence. Fragments of DNA up to ~900 bases long can be sequenced as one ‘read’, with 99.9% accuracy1.
Sanger sequencing was the most widely used method of DNA sequencing until the 2010s, when Next-Generation sequencing methods began to be used for large-scale genome analyses due to their increased speed and efficiency and lower costs. Sanger sequencing is still frequently used for small individual pieces of DNA, or for validation of Next-Generation sequencing where high accuracy is required.
How does Sanger sequencing work?
Sanger sequencing uses a chain-termination reaction involving fluorescently labelled nucleotides. The following ‘ingredients’ are required:
- The ‘Template DNA’ – the double-stranded DNA molecule to be sequenced. To sequence a DNA molecule longer than ~900 bases, this is first cut up into smaller fragments, each of which is sequenced separately.
- A DNA primer – a short piece of single-stranded DNA that binds to the template DNA, and is required as a starter for DNA polymerase
- DNA polymerase, a protein that carries out the DNA synthesis reaction
- The four nucleotides (A, T, C, and G), which are known as deoxynucleotide triphosphates or dNTPs
- Small amounts of chain-terminating nucleotides, known as dideoxynucleotide triphosphates or ddNTPs. The ddNTPs are labelled with a fluorescent dye, each in a different colour.