Genotyping and gene expression analysis: Sanger to high throughput sequencing

Sequencing Technologies

Sanger Sequencing

Estevezj, CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0, via Wikimedia Commons

Sanger Sequencing, also known as the chain termination method, was developed by Frederick Sanger in 1977. It is a method of DNA sequencing based on the selective incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication.

Principle: Incorporation of dideoxynucleotides (ddNTPs), which lack the 3’-hydroxyl group, resulting in chain termination once incorporated.
Process: The DNA sample is divided into four separate sequencing reactions, each containing one of the four ddNTPs. DNA fragments of varying lengths are produced, each ending with the incorporated ddNTP.
Read Length: Produces read lengths of about 500 to 900 bases.
Accuracy: Sanger sequencing is known for its high accuracy, but it has a low throughput compared to newer methods.

Frederick Sanger developed chain termination method for DNA sequencing

Uses modified nucleotides to terminate sequencing reaction

Radioactive labeling used originally to detect sequence

Automated later with fluorescent labels

Allowed first human genome project sequencing

Why High Throughput Sequencing

Human genome project took 13 years, $3 billion, and many labs
Catalyzed development of high throughput sequencing technologies
Greatly increased speed and reduced cost
Illumina became dominant platform

Illumina Sequencing

Illumina Sequencing, also known as next-generation sequencing (NGS), is a massively parallel sequencing technology that offers high-throughput sequencing of DNA and RNA samples.

Principle: Uses sequencing by synthesis (SBS), where fluorescently-labeled nucleotides are added to the DNA strand, and their incorporation is detected by their fluorescence.
Process: DNA is fragmented and adapters are ligated to fragments. The fragments are then attached to a solid surface and amplified to form clusters. Sequencing is performed in a flow cell, and millions of fragments are sequenced in parallel.
Read Length: Typically generates short reads ranging from 50 to 300 bases.
Throughput: Provides a high-throughput option, capable of sequencing millions of fragments simultaneously, making it suitable for large-scale genomic projects.

Attachment of DNA fragments to flow cell, bridge amplification creates clusters

Sequencing by synthesis with reversible terminators

Paired end sequencing by sequencing each end separately

Limitations: Alignment of structural variants, De novo assembly of complex genomes, and sequencing of repetitive regions

Long Read Sequencing

Long Read Sequencing, often associated with technologies like PacBio Sequencing and Oxford Nanopore, provides the ability to sequence much longer DNA fragments.

Principle: Directly sequences single molecules of DNA, enabling the detection of long stretches of nucleotides in a single read without the need for assembly of short reads.
Process: PacBio uses Circular Consensus Sequencing to obtain HiFi reads single-molecule real-time (SMRT) sequencing, where a DNA polymerase incorporates labeled nucleotides into a DNA template strand. Nanopore sequencing uses protein nanopores through which individual DNA molecules are threaded, and changes in ionic current are measured as nucleotides pass through.
Read Length: Capable of generating read lengths in the kilobase to megabase range.
Application: Particularly useful for genome assembly, identification of structural variants, and sequencing of areas with high GC content or repeats.
Can detect DNA modifications (e.g., methylation) directly from the sequence (no need to synthesize a complementary strand).

Example: Highly polymorphic MHC gene HLA-A

Short read sequencing has limitations for complex regions

Long reads span repetitive regions, structural variations better

Currently more expensive than short read sequencing

PacBio and Oxford Nanopore are main long read sequencing technologies

Limitations: Cost, Throughput, Accuracy (Nanopore’s newer technology is trying to read both strands of DNA to improve accuracy), DNA amount, DNA quality

Other Sequencing Technologies

Multiplex Sequencing: Simultaneous sequencing of multiple samples in a single run. (e.g., Illumina’s multiplexing, PacBio)
Virtual Long Reads: Massive barcoding of short reads to create long reads (e.g., MGI’s stLFR)

Comparison Table

Feature	Sanger Sequencing	Illumina Sequencing	Long Read Sequencing
Method	Chain termination	Sequencing by synthesis	Single-molecule
Read Length	500-900 bases	75-300 bases	Kilobase to megabase
Throughput	Low	High	Moderate to High
Accuracy	High	High	Lower than Sanger/Illumina
Suitability	Small-scale projects	Large-scale genomics	Complex genome assembly
Complexity	Low	Moderate	High
Cost	Low per read	Low per base	Higher per read/base
Infrastructure	Basic lab equipment	Specialized equipment	Specialized equipment

Sample Prep for Sequencing

Key steps: sample preparation, adding adaptors, sequencing, data analysis
Adaptors allow sequencing primers to bind
PCR replicates DNA fragments to increase signal

RNA Sequencing

Sequencing RNA provides information on gene expression
Reverse transcription converts RNA to cDNA for sequencing
Strategies for dealing with RNA degradation

Additional Notes

Read Length: the length of a DNA sequence that is read by a sequencing machine
Read Depth: the number of times a DNA sequence is read by a sequencing machine (remove PCR duplicates, which are reads that are identical to other reads)
Coverage: the average number of reads that align to, or “cover”, a given nucleotide in the reference genome during the sequencing process
Deletion: When there is a deletion in the DNA sequence, there will be a gap in the alignment
Adapter Contamination: Adapter contamination occurs when the adapter sequence is accidentally sequenced along with the DNA fragment. This can happen when the adapter is not properly removed during the library preparation process. Adapter contamination can be detected by looking for the adapter sequence in the reads.