Overview of Large-Scale Genomic Datasets
Three Types of Genomic Datasets
Clinical Cohorts:
- Have a specific disease or condition.
- Broader clinical recruitement cohort (rare diseases, suspected genetic)
Healthy controls:
- free of the disease of interest
- Only healthy at the time of recruitment
Population datasets:
- Reflect general population genetics
- Biases could be introduced by recruitment design
- Retrospective, prospective, or longitudinal studies
Clinical Cohorts
Examples:
-
The
Deciphering Developmental Disorders (DDD)study, which recruited approximately 14,000 individuals with severe, undiagnosed developmental disorders in the UK and performedexome sequencingandarray-basedgenotyping. -
Genomics England’s
100,000 Genomes Project, rare disease and cancer patients.Whole-genome sequencing
Control Cohorts
Advantages:
- Cleaner control set
- Valuable for late onset / common conditions that are likely to present in population datasets
Disadvantages:
- expense to recruit and testing
- Limited sample size
- Only healthy at the time of recruitment, unless followed up
Example:
ECCO-GENEgyptian Collaborative Cardiac Genomics
Population Datasets
Examples:
-
gnomAD: Genome Aggregation Database
v2.1.1: 125,748 exomes and 15,708 genomes \
v3.1.2: 76,156 whole-genome seuqences \Considerations:
- Individuals from common disease studies
- Severe disease removed
- Overlap between v2.1.1 and v3.1.2
- Aggregate level data only (Counts of variants)
- No individual level data (variant in different genes)
- No phenotypic data
- Sub-cohorts
-
UK Biobank- Recruited individuals between the ages of 40-70
- One of the first large-scale population cohorts
- Extensive
phenotypic,imaging, andgeneticdata - Both
commonandrarediseases - Cohort are still being followed up
- Accessible to researchers
Other Large-Scale Datasets
All of Us: US BiobankFINNGEN/Decode: Finnish and Icelandic Bottleneck populations, EHR data23andMe: Direct to consumer genetic testing, questionnairesWellcome Trust Case Control Consortium (WTCCC): Genotype chip data for common diseases and controls
Considerations for Using Large-Scale Datasets
General Considerations
- Participants overlap between studies
- Different genome builds
- Population datasets could be used as disease or control cohorts with phenotypic data
- different pipelines and quality control might not be consistent
Participation Bias
-
Willingness to participate in research studies
-
Methods of recruitment and
specific locationsorplatformscan introduce bias -
Population cohorts doesn’t mean it’s repesentative
Example:
Bias in
UK BiobanktowardshealthierindividualsEuropean ancestry in genetic studies
Diversity in Genetics
Diversity in genetic studies is important
- Rates of genetic diseases is different around the world
- Healthcare disparities
- Likelihood of being diagnosed in rare diseases
- Accuracy of genetic risk prediction
Methods:
- Combining biobanks
- Consortia for sharing ideas and data
Rare Variant Association Analysis
Methods to discover gene-disease associations
-
But often rely on Pedigree data and multiple affected individuals
-
How to find?
- When fully penetrant, we only observe the variant in affected individuals
- If the phenotype is not absent in controls, we need to test if the variant is enriched in cases
Why do we study rare variants?
Variants with large effect on protein function are rare due to negative selection
Collectively, they are common
Single Variant Association Tests
Fishers Exact Test:
| Has disease? \\ Has variant? | Yes | No |
|---|---|---|
| Yes | a | b |
| No | c | d |
Used to test probability of observing by chance
Odds ratio:
Rare Variant Collapsing Tests
Aggregate the number of rare variants in a gene or functional region
-
Burden test: Test if the number of rare variants is higher in cases than controls
-
Limitations:
- Assumes all rare variants are ‘causal’
- Assume same direction of effect (e.g. gain/loss of function)
- If not true, then power is limited
SKAT / SKAT-O
-
SNP-set (Sequence) Kernel Association Test
-
Works with both rare and common variants
-
Can be adjusted for covariates
-
Can have different effect sizes and directions
Limitations:
- Lower power than
collapse testingwhen all variants have an effect or the same direction of effect
Categories of Variants
Testing smaller functional categories (e.g. missense, loss of function, etc.) will have greater power
Variants can be weighted
Weighting variants by their likelihood of being causal can increase power
By allele frequency- By predicted deleteriousness -
Loss of function,damaging missense, etc.
Example: DenovoWEST, SAIGE-GENE
Considerations - Multiple Testing
Combining multiple tests, we need to strict the threshold for significance
- Bonferroni correction:
Key Assumption: Independence between tests