Historical Approaches in Human Disease Genetics

Some history

Before 1990: Expensive, time-consuming, and labor-intensive to study genetic diseases. Smaller lab could only do segregation analysis, collecting non-genetic data from families to infer genetic models.
1990s:
- Linkage Analysis: Genotyping small numbers of variants spread thinly across the genotype in disease families to find regions of the genome associated with a disease.
- Candidate Gene Studies: Testing pre-selected variants in gene of interest for asoociation with disease.
2005 onwards: Whole-genome genotyping and sequencing became more affordable and accessible, leading to genome-wide association studies (GWAS).

Genotyping vs. Sequencing: Genotyping involves testing for specific variants, while sequencing involves reading the entire genome.

Segregation Analysis

Genetics with no genetic data

Involves studying family trees (pedigrees) to determine inheritance patterns of diseases without genetic data.

Example: Cycstic Fibrosis was identified as a recessive trait by Dorothy Hansine Andersen using segregation analysis.

Multiple models explaining the inheritance mode of a disease

Complex Segregation Analysis (CSA)

Build a statistical model of gneetic risk, and average over all possible (unknown) genotypes present in the family.

Need to have lots of families to get a good estimate of the genetic model.

CSA likelihood measures how likely you will see the pattern of disease observed across all faimilies under a genetic model

Use log likelihood because the the numbers are normally quite small.

Pick the least negative is the best model (result always negative).

Unified Model

segregation analysis tool - implemented in the package SAGE

It uses:
- A major locus, an allele frequency, and penetrance
- A polygenic component
- A environmental component

All parameters are estimated from the data.

A good Example: Parameterizing BRCA mutations in familial breast cancer.
A bad Example: Medical school is a recessive trait.

Genetic Linkage in Families

Linkage maps

Made with Linkage Markers - highly polymorphic markers spaced throughout the genome.

Earliest days - less than 200 variants
Now about 3000 variants

Parametric Linkage and LOD score

Prarametric linkage uses likelihood-based modeling as CSAs

A specified genetic model
Test every site in the genome for how consistent the marker data is with the genetic model

The LOD score is the log of the likelihood ratio of the data under the null hypothesis (no linkage) to the alternative hypothesis (linkage).

Larger LOD score means more evidence for linkage.
LOD score of 3 is considered significant evidence for linkage. (>4 is safer)

None parametric linkage

Wrong parametric model can lead to poor results.

Non-parametric linkage tests for the family members are more closely related to each other at the site of the disease gene than expected by chance.

Identical by Descent (IBD)

IBD 0: Siblings share no alleles by descent at a locus.
IBD 1: Siblings share one allele by descent.
IBD 2: Siblings share both alleles by descent.

In absence of disease risk, siblings have 25/50/25 ratio of IBD 0/1/2.

Dominant disease: 50/50/0
Recessive disease: 100/0/0

Almost all genetic models generated this effect

The power of this method is lower but it is more robust due to no need for a correct genetic model.

Example: Continuing the BRCA example, Linkage analysis was used to map the gene to chromosome 17.

Rise and Fall of Linkage Studies

Successes: Cystic Fibrosis, Huntington’s Disease, and BRCA1
Some successes in complex diseases: inflammatory bowel disease
Need large effect sizes
Problems with heterogeneity

Candidate Gene Studies

Sequencing and genotyping were more affordatble

Pick a gene that is thought to be associated with a disease
Collect a few hundred cases and controls
Genotype variants in the gene
Test for association

Advantages:

Not too expensive
Results for a gene
Power is high for odd ratios > 1.5

Good Example: PPARγ(G)'s Pro12Ala mutation was found to be associated with diabetes.

Bad Example: Multiple candidate genes for depression failed to replicate in larger studies.

Problems:

Power: Probability of getting positive result and the hypothesis is True.
- Effect size
- Sample size
- Allele frequency
Type 1 Error: Probability of getting positive result and the hypothesis is False.
Prior: Probability of the hypothesis being True.
- In candidate gene studies, it is the proportion of genes that are actually increasing risk.

Pick Gene	Hypothesis	Result
Pick a gene and variant	Variants increase risk (Prior)	p < 0.05 True Positive (Power)
	Variants increase risk (Prior)	p > 0.05 False Negative (1 - Power)
	Variants does not increase risk (1 - Prior)	p < 0.05 False Positive
	Variants does not increase risk (1 - Prior)	p > 0.05 True Negative

Pr(\text{True Positive}|\text{Positive}) = \frac{\text{prior} \times \text{power}}{\text{prior} \times \text{power} + (1 - \text{prior}) \times \alpha}

We know:

$\alpha$ threshold for significance
The sample size
Allele frequency

We don’t know:

Prior
Effect size

With a large sample size, and a small p-value, we can get a good estimate even with a small prior.