0%

Sequencing and Population Datasets

Overview of Large-Scale Genomic Datasets

Three Types of Genomic Datasets

Clinical Cohorts:

  • Have a specific disease or condition.
  • Broader clinical recruitement cohort (rare diseases, suspected genetic)

Healthy controls:

  • free of the disease of interest
  • Only healthy at the time of recruitment

Population datasets:

  • Reflect general population genetics
  • Biases could be introduced by recruitment design
  • Retrospective, prospective, or longitudinal studies

Clinical Cohorts

Examples:

  • The Deciphering Developmental Disorders (DDD) study, which recruited approximately 14,000 individuals with severe, undiagnosed developmental disorders in the UK and performed exome sequencing and array-based genotyping.

  • Genomics England’s 100,000 Genomes Project, rare disease and cancer patients. Whole-genome sequencing

Control Cohorts

Advantages:

  • Cleaner control set
  • Valuable for late onset / common conditions that are likely to present in population datasets

Disadvantages:

  • expense to recruit and testing
  • Limited sample size
  • Only healthy at the time of recruitment, unless followed up

Example:

  • ECCO-GEN Egyptian Collaborative Cardiac Genomics

Population Datasets

Examples:

  • gnomAD: Genome Aggregation Database
    v2.1.1: 125,748 exomes and 15,708 genomes \
    v3.1.2: 76,156 whole-genome seuqences \

    Considerations:

    • Individuals from common disease studies
    • Severe disease removed
    • Overlap between v2.1.1 and v3.1.2
    • Aggregate level data only (Counts of variants)
      • No individual level data (variant in different genes)
    • No phenotypic data
    • Sub-cohorts
  • UK Biobank

    • Recruited individuals between the ages of 40-70
    • One of the first large-scale population cohorts
    • Extensive phenotypic, imaging, and genetic data
    • Both common and rare diseases
    • Cohort are still being followed up
    • Accessible to researchers

Other Large-Scale Datasets

  • All of Us: US Biobank
  • FINNGEN / Decode: Finnish and Icelandic Bottleneck populations, EHR data
  • 23andMe: Direct to consumer genetic testing, questionnaires
  • Wellcome Trust Case Control Consortium (WTCCC): Genotype chip data for common diseases and controls

Considerations for Using Large-Scale Datasets

General Considerations

  • Participants overlap between studies
  • Different genome builds
  • Population datasets could be used as disease or control cohorts with phenotypic data
  • different pipelines and quality control might not be consistent

Participation Bias

  • Willingness to participate in research studies

  • Methods of recruitment and specific locations or platforms can introduce bias

  • Population cohorts doesn’t mean it’s repesentative

Example:

  • Bias in UK Biobank towards healthier individuals

  • European ancestry in genetic studies

Diversity in Genetics

Diversity in genetic studies is important

  • Rates of genetic diseases is different around the world
  • Healthcare disparities
    • Likelihood of being diagnosed in rare diseases
    • Accuracy of genetic risk prediction

Methods:

  • Combining biobanks
  • Consortia for sharing ideas and data

Rare Variant Association Analysis

Methods to discover gene-disease associations

  • Linkage analysis

  • Segregation analysis

  • But often rely on Pedigree data and multiple affected individuals

  • How to find?

    • When fully penetrant, we only observe the variant in affected individuals
    • If the phenotype is not absent in controls, we need to test if the variant is enriched in cases

Why do we study rare variants?

Variants with large effect on protein function are rare due to negative selection

Collectively, they are common

Single Variant Association Tests

Fishers Exact Test:

Has disease? \\ Has variant? Yes No
Yes a b
No c d

Used to test probability of observing by chance

P-value=(a+b)!(c+d)!(a+c)!(b+d)!a!b!c!d!(a+b+c+d)!P\text{-value} = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a! b! c! d! (a+b+c+d)!}

Odds ratio:

Odds ratio=a/bc/d\text{Odds ratio} = \frac{a/b}{c/d}

Rare Variant Collapsing Tests

Aggregate the number of rare variants in a gene or functional region

  • Burden test: Test if the number of rare variants is higher in cases than controls

  • Limitations:

    • Assumes all rare variants are ‘causal’
    • Assume same direction of effect (e.g. gain/loss of function)
    • If not true, then power is limited

SKAT / SKAT-O

  • SNP-set (Sequence) Kernel Association Test

  • Works with both rare and common variants

  • Can be adjusted for covariates

  • Can have different effect sizes and directions

Limitations:

  • Lower power than collapse testing when all variants have an effect or the same direction of effect

Categories of Variants

Testing smaller functional categories (e.g. missense, loss of function, etc.) will have greater power

Variants can be weighted

Weighting variants by their likelihood of being causal can increase power

  • By allele frequency
  • By predicted deleteriousness - Loss of function, damaging missense, etc.

Example: DenovoWEST, SAIGE-GENE

Considerations - Multiple Testing

Combining multiple tests, we need to strict the threshold for significance

  • Bonferroni correction: Pthreshold=threshold(number of tests performed)P_{\text{threshold}} = \frac{threshold}{\text{(number of tests performed)}}

Key Assumption: Independence between tests