Overview of rare and common disease genetic architecture

Why study genetic diseases?

Some treatments are empirically (根据经验) effective but mechanism is unknown
Contemporary medicine (现代医学)
- Treat diseases by the causes
Genetics
- Better understanding of disease, diagnosis, treatment and screening
Knowing the variant might lead to treatment

Monogenetic diseases

Caused by variants in one or a few genes (Monogenic diseases)
Inherited in Mendelian patterns (dominant, recessive, X-linked)
Often individually very rare, but collectively common

Genetic diagnosis

Genetic diagnosis	Clinical diagnosis
Identify the cause of disease	Identify the disease
Genetic diagnosis can lead to clinical diagnosis	Clinical diagnosis can suggest which genes to test

Many conditions are named after the gene that is mutated/defective

Family screening

To identify other family members who are at risk of developing the disease

prenatal screening

To identify parents who are at risk of having affected child

Healthy child could be selected using pre-implantation genetic diagnosis (PGD)

Inform decisions about future pregnancies

recurrence risk counselling (复发风险咨询)

To inform parents of the risk of having another affected child

If the disease is “de novo” (newly in the child), the risk of recurrence is low

Pompe disease

Process:

Mutation of GAA gene -> Incifient of acid alpha-glucosidase -> Glycogen breakdown -> glycogen build up -> Damage to muscle and other cells

Treatment:

Enzyme replacement therapy (ERT)
Supportive therapy
- mechanical ventilation
- feeding tube
- physical therapy

Future treatment

DNA editing
Antisense oligonucleotides (ASOs)

Understanding the biology of disease

Genetic can implicate new pathogenic pathways

GWAS (Genome-wide association studies) can be unbiased screen identify new genes and pathways involved in disease

Example: Crohn’s disease - T300A variant in the ATG16L1 gene. Reduced autophagy (自噬) in response to bacteria.

Discovering new drug targets

Examples:

PCSK9 - linkage studies, gain-of-function mutations cause hypercholesterolemia (高胆固醇血症).

Drug targets PCSK9 to lower cholesterol approved by FDA in 2015.

BCL11A - GWAS, beta-thalassemia (地中海贫血) and sickle cell anemia (镰状细胞贫血).

Ongoing clinical trials for gene editing to treat these diseases.

Long time from discovery to treatment

Genetic information in drug development

Similarities between phenotypes to the actual indication

Blood glucose variant for type 2 diabetes drug

Source of the genetic variant

Monogenic vs polygenic

Confidence of how we could impact on the gene

Coding vs non-coding

Predicting disease onset with genetics (disease onset: 发病)

Using genome-wide polygenic risk scores (PRS) to predict disease onset

Personalised medicine

Help to tailor the right medicine and right dosage to the right patient

Adverse events (i.e. side effects)
Drug efficacy (i.e. effectiveness)
Drug dosage (i.e. amount of drug to give)

Example: Warfarin for blood clotting
Too much: bleeding
Too little: clotting

Genetic architecture

Aspect	Rare, Monogenic Diseases	Common, Complex Diseases
Mode of Inheritance	Dominant, recessive, X-linked, etc. De novo (new mutations) or inherited	Often polygenic and multifactorial with no clear pattern of inheritance
Heritability	High; nearly all variation in disease risk can be genetic	Variable; a lower proportion of disease risk is genetic
Genetic Heterogeneity	Allelic (different mutations in the same gene) and Locus (mutations in different genes) can both occur	More common due to many genes and environmental factors involved
Effect Size/Frequency Spectrum	Often rare variants of large effect	Can involve major loci (common variants of large effect), common variants of small effect, and a “polygenic tail” (many genes each contributing a small amount)
Penetrance	Typically high; most mutation carriers develop the disease	Variable; many carriers of risk alleles do not develop the disease
Rate of Sporadic Phenocopies	Relatively low, as most cases are genetic	Higher, due to the complex interaction of genetic and environmental factors
Genetic Model	Typically follows Mendelian patterns of inheritance	Often involves additive effects, but can also include dominant patterns, gene-by-environment interactions, and epistasis (gene-gene interactions)

Rare diseases

Mode of inheritance

Domaina, Kashmiri and SUM1, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

Dominant: one copy of the variant is sufficient to cause disease
- One parent is affected and one parent is unaffected, 50% chance of passing on the variant
Recessive: two copies of the variant are required to cause disease
- Both parents are unaffected carriers, 25% chance of passing on the variant

Homozygous vs. compound heterozygous

Recessive can becaused by two allel with same variant (homozygous) or two allel with different variant (compound heterozygous)

X-linked dominant: one copy of the variant on the X chromosome is sufficient to cause disease
- Affected fathers pass on the variant to all daughters but no sons
- Affected mothers pass on the variant to 50% of sons and daughters

Domaina, Angelito7 and SUM1 Derivative work: SUM1, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

X-linked recessive:
- Affected fathers pass on the variant to all daughters who are unaffected carriers and no sons
- Carrier mothers pass on the variant to 50% of sons who are affected and 50% of daughters who are unaffected carriers

Domaina, Kashmiri and SUM1 Derivative work: SUM1, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

De Novo: new mutations that arise in the child
- Neither parent is affected, but the child is affected
- Can be dominant or recessive

Inherited vs. sporadic diseases (遗传性 vs. 散发性)

Aspect	Inherited Disease	Sporadic Disease
Description	Disease that occurs in family members across multiple generations	Disease that arises in an individual without a clear family history
Inheritance	Can be dominant or recessive	Often de novo mutations; can occur without any known inheritance
Penetrance	May be complete (all individuals with the mutation express the phenotype) or incomplete (not all individuals express the phenotype)	Not necessarily applicable, but if genetic, may influence the chance of disease
Genetic Factors	Usually has a clear genetic basis	May not be genetic; if genetic, it influences the likelihood rather than directly causing the disease

Identify causal genes

For some diseases, single gene is sufficient to cause disease (i.e. neurofibromatosis(神经纤维瘤病))
For other diseases, multiple genes can cause disease (i.e. Cardiomyopathies(心肌病))
Single causal gene may cause multiple diseases (i.e. mutations in the same gene can cause different diseases)
Multiple variants could be on the same gene
Very small proportion of disease is caused by any single variant

Disease Mechanism

Gain vs. loss of function

Mechanism	Loss-of-Function	Gain-of-Function
Protein Function	Absent or non-functional protein	Heightened activity of a protein, such as an overactive kinase
Genetic Variants	Variants cause RNA degradation or disrupt a critical protein domain	Variants lead to increased function or activity of the protein
Prevalence	Commonly observed in various genetic disorders	Rarer and often harder to detect than loss-of-function mutations

Dominant negative

A variant in one allele can affect the protein product of the wild-type allele

Early vs. late onset

Age of onset may be influenced by the type of variant. (e.g. Huntington’s disease(亨廷顿舞蹈病))

Complex diseases

Heritability is the proportion of phenotypic variance due to genetic factors

The Additive Genetic Model for a Single Variant

This model assumes that the effect of a single genetic variant on a phenotype is additive, meaning that each additional copy of the variant has the same effect on the trait.

The equation for the additive genetic model for a single variant is:

y_i = \beta_j \times g_{ij} + e_i

Where:

$y_i$ is the phenotypic value for individual $i$ .
$\beta_j$ is the effect size of variant $j$ .
$g_{ij}$ is the genotype dosage for individual $i$ at variant $j$ (0, 1, or 2).
$e_i$ is the residual error for individual $i$ .

Scenario: Researchers have identified a single nucleotide polymorphism (SNP) in the human genome that is associated with height. They wish to determine how this SNP affects height across a population.

Application: Each individual’s height ( $y_i$ ) is modeled based on their genotype at this SNP ( $g_{ij}$ ). The effect size ( $\beta_j$ ) indicates how much the SNP affects height. For instance, they may find that each copy of the minor allele increases height by 0.5 cm. The model accounts for the SNP’s effect while acknowledging that other factors (captured in $e_i$ ) also influence height.

The Additive Polygenic Model

This model extends the single variant model to include multiple genetic variants. It assumes that the phenotype is the result of additive effects from several genetic variants.

The equation for the additive polygenic model is:

y_i = \sum_{j=1}^{K} \beta_j \times g_{ij} + E_i

Where:

$y_i$ is the phenotypic value for individual $i$ .
$\beta_j$ is the effect size of variant $j$ .
$g_{ij}$ is the genotype dosage for individual $i$ at variant $j$ .
$K$ is the total number of genetic variants considered.
$E_i$ is the non-genetic residual error for individual $i$ .

Scenario: Further research shows that height is influenced by many genetic variants across the genome, not just one SNP.

Application: A polygenic model sums the effects of many SNPs to predict individual height ( $y_i$ ). Each SNP’s effect is weighted by its effect size ( $\beta_j$ ), and the genotype dosage ( $g_{ij}$ ) of each SNP is considered. This model is typical in genome-wide association studies where researchers assess the contribution of many genetic factors to variations in height across a population.

Defining the Additive Heritability of a Trait

Additive heritability quantifies how much of the variance in a trait can be attributed to genetic factors.

The model used to define the additive heritability of a trait is:

y_i = A_i + E_i

Where:

$y_i$ is the phenotypic value for individual $i$ .
$A_i$ is the additive genetic component for individual $i$ .
$E_i$ is the environmental and non-genetic residual error for individual $i$ .

Additive heritability, denoted as $h^2$ , can be defined in two equivalent ways:

$h^2$ is the square of the correlation coefficient between the genetic component $A$ and the trait $y$ .
$h^2$ is the ratio of the variance of $A$ to the variance of $y$ .

Scatterplots are used to visualize the relationship between $A$ and $y$ , and the heritability is often computed as the proportion of the variance explained by the genetic factors.

Scenario: After acknowledging that height is affected by numerous genetic factors, researchers want to quantify how much of the variation in height within a population is due to genetic differences, as opposed to environmental factors like nutrition.

Application: Heritability estimation is used here. The total variance in height in the population ( $y_i$ ) is decomposed into a genetic component ( $A_i$ ), which is the sum of the effects of all genetic variants, and an environmental component ( $E_i$ ). By comparing the variation due to genetics with the total variation, researchers can estimate the heritability ( $h^2$ ) of height. A high heritability suggests that genetic factors play a significant role in determining height, while a lower heritability indicates that environmental factors have a greater impact.

From continuous to binary traits

Some traits are binary (i.e. affected vs. unaffected) rather than continuous (i.e. height). The same models can be used to study binary traits, but the interpretation of the results is different.

Liablity Threshold Model

Everyone has a continuous “disease liability” ( $y$ ), and you have the disease if this goes above a threshold $T$ :

y_i > T

Logistic Risk Model

Individuals are assigned a continuous log odds ( $y$ ). Your risk of disease depends on $y$ :

\Pr(\text{Disease}) = \frac{1}{1 + \exp(-a - b \times y_i)}

This equation represents the logistic function used to model the probability of disease presence.

Calculating Heritability from Twin Studies

Using monozygotic (MZ) twins, which are genetically identical, we can estimate heritability of a trait. Assuming that the correlation between MZ twins is purely due to genetics, the correlation can be described by the formula:

\text{cor}(y_{mz1}, y_{mz2}) = h^2

This makes it seem straightforward to estimate heritability.

However, we must consider that twins also share environmental factors. To account for this, we can include dizygotic (DZ) twins in our analysis, who share about half of their DNA and are assumed to share similar environmental conditions as MZ twins.

If we assume that both genetic and environmental components are independent, for MZ twins we can express the correlation as:

\text{cor}(y_{mz1}, y_{mz2}) = h^2 + e^2

Under the assumption that MZ and DZ twins experience the same level of environmental similarity (the “equal environment assumption”), we can then write the correlation for DZ twins as:

\text{cor}(y_{dz1}, y_{dz2}) = \frac{h^2}{2} + e^2

Simple algebra gives us the equation to calculate heritability:

\text{cor}(y_{mz1}, y_{mz2}) - \text{cor}(y_{dz1}, y_{dz2}) = \left(h^2 + e^2\right) - \left(\frac{h^2}{2} + e^2\right) = \frac{h^2}{2}

Thus, with adequate data from MZ and DZ twins, we can determine the heritability of traits with relative ease.

Common modifiers of rare diseases

Increase or decrease the expression of a variant may affect the disease penetrance, severity or expression

Natural Selection and Genetic Variants

The segregation of genetic variants within a population, including their frequency, is influenced by natural selection. A simplified representation of this concept is:

s_{\text{variant}} \propto K \times s_{\text{disease}} \times \beta_{\text{variant}}

$s_{\text{variant}}$ : Selective coefficient of the variant, indicating the strength of selection against that variant.
$K$ : Prevalence of the disease within the population.
$s_{\text{disease}}$ : Selective coefficient of the disease, representing the reduction in the number of offspring produced by individuals with the disease compared to those without.
$\beta_{\text{variant}}$ : Effect size of the variant on disease risk.

Key Insights:

Natural selection tends to decrease the frequency of variants with large effect sizes on diseases, hence these variants are less common and have lower allele frequencies.
Variants with large effects are often found in diseases that are either less common (e.g., certain autoimmune diseases with a low $K$ ) or those that manifest later in life (e.g., Alzheimer’s disease, macular degeneration with a low $s_{\text{disease}}$ ), where the selection pressure against them is relatively weaker.

Variant and Effect

Major loci (common variants of large effect)
- Large effect size, high frequency
- Very important
- Not all traits have major loci (most don’t)
- Example: APOE gene and Alzheimer’s disease
Common variants of small effect
- Small effect size
- Often non-coding
Rare variants of large effect
- Low frequency, large effect size
- Hard to find, need whole exome or genome sequencing
- Do not contribute much to heritability
- Example: Type 2 diabetes
Polygenic tail (many genes each contributing a small amount)
- Small effect size
- Almost all traits have a polygenic tail
- Example: Height, 12111 variants discovered

Common modifiers of rare diseases

Regulatory variants can increase or decrease the expression of a variant, affecting the disease penetrance, severity or expression

Individual polygenic background can also affect the risk and expression of a variant

Common and rare diseases can have similar symptoms

Somtimes mutations in the same gene can cause rare or common diseases

There isn’t a clear boundary between monogenic and polygenic diseases

Allele Frequency and Penetrance

Allele Frequency

The incidence of an allele in a population

Allele: Major allele, minor allele, ancestral allele, variant allele
population: may be different in different populations

Determine allele frequency

\text{Allele frequency} = \frac{\text{Number of copies of the allele}}{\text{Total number of alleles}}

For example:

A||A A||A A||A A||G A||G A||A A||G G||G G||G

Allele frequency for A: 11/18 = 0.61
Allele frequency for G: 7/18 = 0.39

Genotype frequency for AA: 4/9 = 0.44
Genotype frequency for AG: 3/9 = 0.33
Genotype frequency for GG: 2/9 = 0.22

gnomAD: Can be used to determine allele frequency

Sometimes it might looks very rare, but it’s not rare in a particular population

We can only determine the allele frequency in the population we have data for, and it might not be representative of the population we are interested in

Penetrance

The proportion of individuals with a variant who develop the disease

Can be measured in different ways:

Often reported as a percentage
Often by age
Can be measured in a particular variant or gene

Penetrance vs. Expressivity

Penetrance: The proportion of individuals with a variant who develop the disease
Expressivity: The severity of the disease in individuals who develop it

Why is penetrance important?

To inform genetic counselling
Incidental findings

How to determine penetrance?

Ideally:
- Unbiased cohort study with particular genotype
- Long follow-up
- Determine the proportion of individuals who develop the disease
Family studies
Population studies
Bayesian theorem:
- $P(\text{Disease}|\text{Genotype}) = \frac{P(\text{Allele}|\text{Disease}) \times P(\text{Disease})}{P(\text{Allele})}$
- $Penetrance = \text{Disease prevalence} \times \frac{\text{Case allele frequency}}{\text{Population allele frequency}}$
- Disease prevalence: hard to estimate for rare diseases
- Case allele frequency: hard to estimate for rare alleles