0%

Common Disease and GWAS

Design GWAS

What is GWAS?

GWAS is useful for identifying common variants with small effects that increase or decrease the risk of disease.

We want these associations to be:

  • Unbiased
  • Well powered
  • High certainty
  • Meaningful (e.g. implicate new genes)

Process of running a GWAS

  1. Recruitment of cases and controls
  2. Collect DNA samples (blood, saliva, etc.)
  3. Use high-throughput technology to assay common variants (> 1 million)
  4. Test every variant for association with the disease
  5. Find regions of the genome with significant p-values (e.g. p < 5 x 10^-8)

Example: Wellcome Trust Case Control Consortium (WTCCC)

Large-scale GWAS that aimed to identify genetic variants associated with 7 common diseases. Except for bipolar (false positive), other diseases had 1-9 associations for each disease.

Manhattan plots and locus plots

  • Manhattan plots: Show the p-values across the genome

Manhattan plot

  • Locus plots: Show the association of a specific region of the genome with the disease

Locus plot

Higher sample point indicates higher significance of the association

QQ (Quantile-Quantile) plots and lambdaGC

  • QQ plots: Compare the observed p-values to the expected p-values
  • LambdaGC λGC\lambda_{GC}: Measure of inflation (the deviation) of the test statistics

λGC=median(χ2)0.455\lambda_{GC} = \frac{median(\chi^2)}{0.455}

Under the null hypothesis, median chi-squared statistic should be 0.455. So λGC=1\lambda_{GC} = 1. High inflation value means (λGC>1.2\lambda_{GC} > 1.2) means there is a problem (confounding) with the data.

GWAS design considerations and questions

Collecting cases and controls

  • Define and recruit cases
  • Define and recruit controls
  • Match cases and controls for confounding

Measuring genotypes

  • Genotyping technologies (e.g. microarrays, sequencing)
  • Genotype imputation (i.e. inferring untyped variants)

Maximising power and reproducibility

  • Replication
  • Meta-analysis

Finding cases for GWAS

  • Strict definitions (e.g. clinical diagnosis) vs loose definitions (e.g. any symptoms)
  • Recruitment via clinics vs population datasets (e.g. UK Biobank)
  • Define in clinic (e.g. physician judgement) vs self-report (e.g. questionnaire)

Finding controls for GWAS

  • Cases vs population or healthy controls
  • Depending on prevalence
  • Is the variant more common in cases or controls?

Impact of control selection on power

  • Low impact for low-prevalence diseases
  • High impact for high-prevalence diseases

Matching cases and controls

  • Cases and controls should be matched

    • Ancestry (e.g. ethinicity, country of origin)
    • Technical factors (e.g. genotyping platform, batch)
  • Other factors to consider

    • Age and sex
    • Environmental factors (e.g. smoking, diet)

Matching can increase power

Genotyping technologies

  • Microarrays: Genotype common variants
    • Illumina, Affymetrix
    • 60k - 4 million variants
    • Low cost, high throughput

Other than common variants?

  • we use genotype imputation to infer untyped variants

  • Nearby SNPs are correlated, so we can use tag SNPs to impute untyped variants

  • Need good Reference set

    • large, diverse, sample matched to the study population

Work well for common variants, but not for rare variants

Other technologies

  • Low coverage whole genome sequencing

    • Sequencing followed by imputation
    • Low cost
    • Can be more accurate, give info about rare variants
    • less routinely used
    • The UK10K project
      • Small sample size
      • Used for imputation for chip studies for 10 years
  • High coverage whole genome sequencing

    • More accurate, but expensive
    • No imputation needed
    • UK Biobank
      • DMPK repeats can only be detected by sequencing

Replication in GWAS

  • Independent cohort
  • Independent technology

Advantages of replication

  1. Reduce false positives
    • Reduce the chance of false positive study to < 1%
  2. Mitigates confounding or technical artefacts
    • False positives by population stratification by different country with different population structure
    • False positives by different genotyping technology (e.g. chip used)
  3. More accurate effect size estimation
    • Winner's curse: Significant associations tend to overestimate the true size of the effect.
    • Replication can give a more accurate estimate

Meta-analysis in GWAS

  • Combine results from multiple studies to increase power

Running GWAS

What affects GWAS?

Bad QC (Quality Control) > Bad Data > Bad Results

Quality Control steps

Sample QC

  • Sample Call Rate
    • Remove samples with low call rate
  • Autosomal Heterozygosity
    • Remove samples with deviant heterozygosity
    • Caused by inbreeding, contamination, Ancestry, Data quality
      • Heterozygosity more likely to be missing
  • Sex / Gender check (X chromosome Heterozygosity)
    • Sex check, mislabelled samples
  • Identity by descent (IBD) if too much relatedness
    • Exclude related samples

SNP QC

  • SNP Call Rate
    • Calculate call rate for each SNP
    • Remove SNPs with low call rate
  • Hardy Weinberg Equilibrium (HWE)
    • Remove SNPs with significant deviation from HWE
    • Useful for random mating population

Confounding PCA if too little relatedness

  • Summarizing many variables with minimal loss of information
  • Need clean non-correlated data
  • PCA can reveal
    • Population outliers
    • Population structure / confounding

Interpreting the results of GWAS

Finding risk variants and risk genes

  • Hard to pick because of Linkage Disequilibrium (LD)

  • Fine mapping

    • Association in a region
    • Use Posterior probability to find the causal variant
    • 95% credible set might not be enough
  • Finding function of a causal variant

    • Questions:
      • Does it modify protein code?
      • Does it in Promoter, enhancer, or other regulatory regions?
      • Does it affect expression?
    • expression quantitative trait locus (eQTL)
      • GTEx project can be used to find eQTLs
    • Chromosome conformation capture (3C) and Hi-C
      • Find variant-enhancer affects
    • Finding the genes that are closest to the variant
    • Locus-2-gene method, machine learning method to give a score to each gene

Genetic architecture - how heritable is the trait

  • SNP heritability
    • Variance in the trait that is driven by variation in the SNPs in the study

Methods to estimate SNP heritability

  • CGTA use Genome-based restricted maximum likelihood (GREML) method to estimate heritability of a trait. It is trying to find for more closely related individuals, the more similar their traits are compared to less related individuals.
  • LD Score Regression measures heritability with test statistics and accounts for inflation. LD score slope > 1 means confounding, > 0 means true heritability.

Finding pathogenic cell types and pathways

  • Enrichment testing for GWAS data
  • We want to know is GWAS signal enriched in some set of variants or regions of the genome.
  • partitioned LDSC can stratify heritability into different categories of variants.

Finding relationships with other traits

  • Co-heritable traits: if the risk for one trait is associated with another trait.
    • By causal relationship
    • By Shared pathways
  • Measuring genetic correlation between traits
    • LDSC method

Menelian randomization can be used to find causal relationships between traits

Making polygenic risk scores

Polygenic risk scores (PRS): See Polygenic Risk Scores

Test the model in an independent cohort to see if it works otherwise it is overfitting

  1. A top-hit PRS, select independent, genome-wide significant variants, use normal estimated effect size.
  2. A prune-and-threshold PRS, Set of SNPs not in LD, use variants with p-value below a threshold, normal estimated effect size.
  3. Genome-wide shrinkage, Use Bayesian prior to shrink the effect size of the variants. Use LD to “spread” the effects across correlated variants. e.g. LDpred2

Genome-wide shrinkage is better than top-hit PRS and prune-and-threshold PRS.

Trans-ethnic PRS

Population is important for GWAS. If the population is different, the PRS will be affected.