Historical Approaches in Human Disease Genetics
Some history
-
Before 1990: Expensive, time-consuming, and labor-intensive to study genetic diseases. Smaller lab could only do segregation analysis, collecting non-genetic data from families to infer genetic models.
-
1990s:
- Linkage Analysis: Genotyping small numbers of variants spread thinly across the genotype in disease families to find regions of the genome associated with a disease.
- Candidate Gene Studies: Testing pre-selected variants in gene of interest for asoociation with disease.
-
2005 onwards: Whole-genome genotyping and sequencing became more affordable and accessible, leading to genome-wide association studies (GWAS).
Genotyping vs. Sequencing: Genotyping involves testing for specific variants, while sequencing involves reading the entire genome.
Segregation Analysis
Genetics with no genetic data
Involves studying family trees (pedigrees) to determine inheritance patterns of diseases without genetic data.
Example: Cycstic Fibrosis was identified as a recessive trait by Dorothy Hansine Andersen using segregation analysis.
Multiple models explaining the inheritance mode of a disease
Complex Segregation Analysis (CSA)
Build a statistical model of gneetic risk, and average over all possible (unknown) genotypes present in the family.
Need to have lots of families to get a good estimate of the genetic model.
CSA likelihood measures how likely you will see the pattern of disease observed across all faimilies under a genetic model
Use log likelihood because the the numbers are normally quite small.
Pick the least negative is the best model (result always negative).
Unified Model
segregation analysis tool - implemented in the package SAGE
- It uses:
- A major locus, an allele frequency, and penetrance
- A polygenic component
- A environmental component
All parameters are estimated from the data.
A good Example: Parameterizing BRCA mutations in familial breast cancer.
A bad Example: Medical school is a recessive trait.
Genetic Linkage in Families
Linkage maps
Made with Linkage Markers - highly polymorphic markers spaced throughout the genome.
Earliest days - less than 200 variants
Now about 3000 variants
Parametric Linkage and LOD score
Prarametric linkage uses likelihood-based modeling as CSAs
- A specified genetic model
- Test every site in the genome for how consistent the marker data is with the genetic model
The LOD score is the log of the likelihood ratio of the data under the null hypothesis (no linkage) to the alternative hypothesis (linkage).
Larger LOD score means more evidence for linkage.
LOD score of 3 is considered significant evidence for linkage. (>4 is safer)
None parametric linkage
Wrong parametric model can lead to poor results.
Non-parametric linkage tests for the family members are more closely related to each other at the site of the disease gene than expected by chance.
Identical by Descent (IBD)
- IBD 0: Siblings share no alleles by descent at a locus.
- IBD 1: Siblings share one allele by descent.
- IBD 2: Siblings share both alleles by descent.
In absence of disease risk, siblings have 25/50/25 ratio of IBD 0/1/2.
- Dominant disease: 50/50/0
- Recessive disease: 100/0/0
Almost all genetic models generated this effect
The power of this method is lower but it is more robust due to no need for a correct genetic model.
Example: Continuing the
BRCAexample, Linkage analysis was used to map the gene to chromosome 17.
Rise and Fall of Linkage Studies
-
Successes: Cystic Fibrosis, Huntington’s Disease, and BRCA1
-
Some successes in complex diseases: inflammatory bowel disease
-
Need large effect sizes
-
Problems with heterogeneity
Candidate Gene Studies
Sequencing and genotyping were more affordatble
- Pick a gene that is thought to be associated with a disease
- Collect a few hundred cases and controls
- Genotype variants in the gene
- Test for association
Advantages:
- Not too expensive
- Results for a gene
- Power is high for odd ratios > 1.5
Good Example:
PPARγ(G)'s Pro12Ala mutation was found to be associated with diabetes.Bad Example: Multiple candidate genes for depression failed to replicate in larger studies.
Problems:
- Power: Probability of getting
positiveresult and the hypothesis isTrue.Effect sizeSample sizeAllele frequency
- Type 1 Error: Probability of getting
positiveresult and the hypothesis isFalse. - Prior: Probability of the hypothesis being
True.- In candidate gene studies, it is the proportion of genes that are actually
increasing risk.
- In candidate gene studies, it is the proportion of genes that are actually
| Pick Gene | Hypothesis | Result |
|---|---|---|
| Pick a gene and variant | Variants increase risk (Prior) | p < 0.05 True Positive (Power) |
| p > 0.05 False Negative (1 - Power) | ||
| Variants does not increase risk (1 - Prior) | p < 0.05 False Positive | |
| p > 0.05 True Negative |
We know:
- threshold for significance
- The sample size
- Allele frequency
We don’t know:
- Prior
- Effect size
With a large sample size, and a small p-value, we can get a good estimate even with a small prior.