0%

Genotyping and gene expression analysis: Sanger to high throughput sequencing

Sequencing Technologies

Sanger Sequencing

Sanger Sequencing

Estevezj, CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0, via Wikimedia Commons

Sanger Sequencing, also known as the chain termination method, was developed by Frederick Sanger in 1977. It is a method of DNA sequencing based on the selective incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication.

  • Principle: Incorporation of dideoxynucleotides (ddNTPs), which lack the 3’-hydroxyl group, resulting in chain termination once incorporated.
  • Process: The DNA sample is divided into four separate sequencing reactions, each containing one of the four ddNTPs. DNA fragments of varying lengths are produced, each ending with the incorporated ddNTP.
  • Read Length: Produces read lengths of about 500 to 900 bases.
  • Accuracy: Sanger sequencing is known for its high accuracy, but it has a low throughput compared to newer methods.
  • Frederick Sanger developed chain termination method for DNA sequencing
  • Uses modified nucleotides to terminate sequencing reaction
  • Radioactive labeling used originally to detect sequence
  • Automated later with fluorescent labels
  • Allowed first human genome project sequencing

Why High Throughput Sequencing

  • Human genome project took 13 years, $3 billion, and many labs
  • Catalyzed development of high throughput sequencing technologies
  • Greatly increased speed and reduced cost
  • Illumina became dominant platform

Illumina Sequencing

Illumina Sequencing, also known as next-generation sequencing (NGS), is a massively parallel sequencing technology that offers high-throughput sequencing of DNA and RNA samples.

  • Principle: Uses sequencing by synthesis (SBS), where fluorescently-labeled nucleotides are added to the DNA strand, and their incorporation is detected by their fluorescence.
  • Process: DNA is fragmented and adapters are ligated to fragments. The fragments are then attached to a solid surface and amplified to form clusters. Sequencing is performed in a flow cell, and millions of fragments are sequenced in parallel.
  • Read Length: Typically generates short reads ranging from 50 to 300 bases.
  • Throughput: Provides a high-throughput option, capable of sequencing millions of fragments simultaneously, making it suitable for large-scale genomic projects.
  • Attachment of DNA fragments to flow cell, bridge amplification creates clusters
  • Sequencing by synthesis with reversible terminators
  • Paired end sequencing by sequencing each end separately
  • Limitations: Alignment of structural variants, De novo assembly of complex genomes, and sequencing of repetitive regions

Long Read Sequencing

Long Read Sequencing, often associated with technologies like PacBio Sequencing and Oxford Nanopore, provides the ability to sequence much longer DNA fragments.

  • Principle: Directly sequences single molecules of DNA, enabling the detection of long stretches of nucleotides in a single read without the need for assembly of short reads.
  • Process: PacBio uses Circular Consensus Sequencing to obtain HiFi reads single-molecule real-time (SMRT) sequencing, where a DNA polymerase incorporates labeled nucleotides into a DNA template strand. Nanopore sequencing uses protein nanopores through which individual DNA molecules are threaded, and changes in ionic current are measured as nucleotides pass through.
  • Read Length: Capable of generating read lengths in the kilobase to megabase range.
  • Application: Particularly useful for genome assembly, identification of structural variants, and sequencing of areas with high GC content or repeats.
  • Can detect DNA modifications (e.g., methylation) directly from the sequence (no need to synthesize a complementary strand).

Example: Highly polymorphic MHC gene HLA-A

  • Short read sequencing has limitations for complex regions
  • Long reads span repetitive regions, structural variations better
  • Currently more expensive than short read sequencing
  • PacBio and Oxford Nanopore are main long read sequencing technologies
  • Limitations: Cost, Throughput, Accuracy (Nanopore’s newer technology is trying to read both strands of DNA to improve accuracy), DNA amount, DNA quality

Other Sequencing Technologies

  • Multiplex Sequencing: Simultaneous sequencing of multiple samples in a single run. (e.g., Illumina’s multiplexing, PacBio)
  • Virtual Long Reads: Massive barcoding of short reads to create long reads (e.g., MGI’s stLFR)

Comparison Table

Feature Sanger Sequencing Illumina Sequencing Long Read Sequencing
Method Chain termination Sequencing by synthesis Single-molecule
Read Length 500-900 bases 75-300 bases Kilobase to megabase
Throughput Low High Moderate to High
Accuracy High High Lower than Sanger/Illumina
Suitability Small-scale projects Large-scale genomics Complex genome assembly
Complexity Low Moderate High
Cost Low per read Low per base Higher per read/base
Infrastructure Basic lab equipment Specialized equipment Specialized equipment

Sample Prep for Sequencing

  • Key steps: sample preparation, adding adaptors, sequencing, data analysis
  • Adaptors allow sequencing primers to bind
  • PCR replicates DNA fragments to increase signal

RNA Sequencing

  • Sequencing RNA provides information on gene expression
  • Reverse transcription converts RNA to cDNA for sequencing
  • Strategies for dealing with RNA degradation

Additional Notes

  • Read Length: the length of a DNA sequence that is read by a sequencing machine

  • Read Depth: the number of times a DNA sequence is read by a sequencing machine (remove PCR duplicates, which are reads that are identical to other reads)

  • Coverage: the average number of reads that align to, or “cover”, a given nucleotide in the reference genome during the sequencing process

  • Deletion: When there is a deletion in the DNA sequence, there will be a gap in the alignment

  • Adapter Contamination: Adapter contamination occurs when the adapter sequence is accidentally sequenced along with the DNA fragment. This can happen when the adapter is not properly removed during the library preparation process. Adapter contamination can be detected by looking for the adapter sequence in the reads.