Gene/pathway-based methods are drawing significant attention due to their usefulness in detecting rare and common variants that affect disease susceptibility. The biological mechanism of drug responses indicates that a gene-based analysis has even greater potential in pharmacogenetics. Motivated by a study from the Vitamin Intervention for Stroke Prevention (VISP) trial, we develop a gene-trait similarity regression for survival analysis to assess the effect of a gene or pathway on time-to-event outcomes. The similarity regression has a general framework that covers a range of survival models, such as the proportional hazards model and the proportional odds model. The inference procedure developed under the proportional hazards model is robust against model misspecification. We derive the equivalence between the similarity survival regression and a random effects model, which further unifies the current variance-component based methods. We demonstrate the effectiveness of the proposed method through simulation studies. In addition, we apply the method to the VISP trial data to identify the genes that exhibit an association with the risk of a recurrent stroke. TCN2 gene was found to be associated with the recurrent stroke risk in the low-dose arm. This gene may impact recurrent stroke risk in response to cofactor therapy.
We assessed gene expression profiles in 2,752 twins, using a classic twin design to quantify expression heritability and quantitative trait loci (eQTLs) in peripheral blood. The most highly heritable genes (?777) were grouped into distinct expression clusters, enriched in gene-poor regions, associated with specific gene function or ontology classes, and strongly associated with disease designation. The design enabled a comparison of twin-based heritability to estimates based on dizygotic identity-by-descent sharing and distant genetic relatedness. Consideration of sampling variation suggests that previous heritability estimates have been upwardly biased. Genotyping of 2,494 twins enabled powerful identification of eQTLs, which we further examined in a replication set of 1,895 unrelated subjects. A large number of non-redundant local eQTLs (6,756) met replication criteria, whereas a relatively small number of distant eQTLs (165) met quality control and replication standards. Our results provide a new resource toward understanding the genetic control of transcription.
Assessing gene-gene interactions (GxG) at the gene level can permit examination of epistasis at biologically functional units with amplified interaction signals from marker-marker pairs. While current gene-based GxG methods tend to be designed for two or a few genes, for complex traits, it is often common to have a list of many candidate genes to explore GxG. We propose a regression model with pathway-guided regularization for detecting interactions among genes. Specifically, we use the principal components to summarize the SNP-SNP interactions between a gene pair, and use an L1 penalty that incorporates adaptive weights based on biological guidance and trait supervision to identify important main and interaction effects. Our approach aims to combine biological guidance and data adaptiveness, and yields credible findings that may be likely to shed insights in order to formulate biological hypotheses for further molecular studies. The proposed approach can be used to explore the GxG with a list of many candidate genes and is applicable even when sample size is smaller than the number of predictors studied. We evaluate the utility of the proposed method using simulation and real data analysis. The results suggest improved performance over methods not utilizing pathway and trait guidance.
Gene-gene interactions (G×G) are important to study because of their extensiveness in biological systems and their potential in explaining missing heritability of complex traits. In this work, we propose a new similarity-based test to assess G×G at the gene level, which permits the study of epistasis at biologically functional units with amplified interaction signals.
Gene set analysis is a popular method for large-scale genomic studies. Because genes that have common biological features are analyzed jointly, gene set analysis often achieves better power and generates more biologically informative results. With the advancement of technologies, genomic studies with multi-platform data have become increasingly common. Several strategies have been proposed that integrate genomic data from multiple platforms to perform gene set analysis. To evaluate the performances of existing integrative gene set methods under various scenarios, we conduct a comparative simulation analysis based on The Cancer Genome Atlas breast cancer dataset.
Most psychiatric disorders are moderately to highly heritable. The degree to which genetic variation is unique to individual disorders or shared across disorders is unclear. To examine shared genetic etiology, we use genome-wide genotype data from the Psychiatric Genomics Consortium (PGC) for cases and controls in schizophrenia, bipolar disorder, major depressive disorder, autism spectrum disorders (ASD) and attention-deficit/hyperactivity disorder (ADHD). We apply univariate and bivariate methods for the estimation of genetic variation within and covariation between disorders. SNPs explained 17-29% of the variance in liability. The genetic correlation calculated using common SNPs was high between schizophrenia and bipolar disorder (0.68 ± 0.04 s.e.), moderate between schizophrenia and major depressive disorder (0.43 ± 0.06 s.e.), bipolar disorder and major depressive disorder (0.47 ± 0.06 s.e.), and ADHD and major depressive disorder (0.32 ± 0.07 s.e.), low between schizophrenia and ASD (0.16 ± 0.06 s.e.) and non-significant for other pairs of disorders as well as between psychiatric disorders and the negative control of Crohns disease. This empirical evidence of shared genetic etiology for psychiatric disorders can inform nosology and encourages the investigation of common pathophysiologies for related disorders.
Methods that collapse information across genetic markers when searching for association signals are gaining momentum in the literature. Although originally developed to achieve a better balance between retaining information and controlling degrees of freedom when performing multimarker association analysis, these methods have recently been proven to be a powerful tool for identifying rare variants that contribute to complex phenotypes. The information among markers can be collapsed at the genotype level, which focuses on the mean of genetic information, or the similarity level, which focuses on the variance of genetic information. The aim of this work is to understand the strengths and weaknesses of these two collapsing strategies. Our results show that neither collapsing strategy outperforms the other across all simulated scenarios. Two factors that dominate the performance of these strategies are the signal-to-noise ratio and the underlying genetic architecture of the causal variants. Genotype collapsing is more sensitive to the marker set being contaminated by noise loci than similarity collapsing. In addition, genotype collapsing performs best when the genetic architecture of the causal variants is not complex (e.g., causal loci with similar effects and similar frequencies). Similarity collapsing is more robust as the complexity of the genetic architecture increases and outperforms genotype collapsing when the genetic architecture of the marker set becomes more sophisticated (e.g., causal loci with various effect sizes or frequencies and potential non-linear or interactive effects). Because the underlying genetic architecture is not known a priori, we also considered a two-stage analysis that combines the two top-performing methods from different collapsing strategies. We find that it is reasonably robust across all simulated scenarios.
Genomic association analyses of complex traits demand statistical tools that are capable of detecting small effects of common and rare variants and modeling complex interaction effects and yet are computationally feasible. In this work, we introduce a similarity-based regression method for assessing the main genetic and interaction effects of a group of markers on quantitative traits. The method uses genetic similarity to aggregate information from multiple polymorphic sites and integrates adaptive weights that depend on allele frequencies to accomodate common and uncommon variants. Collapsing information at the similarity level instead of the genotype level avoids canceling signals that have the opposite etiological effects and is applicable to any class of genetic variants without the need for dichotomizing the allele types. To assess gene-trait associations, we regress trait similarities for pairs of unrelated individuals on their genetic similarities and assess association by using a score test whose limiting distribution is derived in this work. The proposed regression framework allows for covariates, has the capacity to model both main and interaction effects, can be applied to a mixture of different polymorphism types, and is computationally efficient. These features make it an ideal tool for evaluating associations between phenotype and marker sets defined by linkage disequilibrium (LD) blocks, genes, or pathways in whole-genome analysis.
With the completion of the international HapMap project, many studies have been conducted to investigate the association between complex diseases and haplotype variants. Such haplotype-based association studies, however, often face two difficulties; one is the large number of haplotype configurations in the chromosome region under study, and the other is the ambiguity in haplotype phase when only genotype data are observed. The latter complexity may be handled based on an EM algorithm with family data incorporated, whereas the former can be more problematic, especially when haplotypes of rare frequencies are involved. Here based on family data we propose to cluster long haplotypes of linked SNPs in a biological sense, so that the number of haplotypes can be reduced and the power of statistical tests of association can be increased.
Penalized likelihood methods have become increasingly popular in recent years for evaluating haplotype-phenotype association in case-control studies. Although a retrospective likelihood is dictated by the sampling scheme, these penalized methods are typically built on prospective likelihoods due to their modeling simplicity and computational feasibility. It has been well documented that for unpenalized methods, prospective analyses of case-control data can be valid but less efficient than their retrospective counterparts when testing for association, and result in substantial bias when estimating the haplotype effects. For penalized methods, which combine effect estimation and testing in one step, the impact of using a prospective likelihood is not clear. In this work, we examine the consequences of ignoring the sampling scheme for haplotype-based penalized likelihood methods. Our results suggest that the impact of prospective analyses depends on (1) the underlying genetic mode and (2) the genetic model adopted in the analysis. When the correct genetic model is used, the difference between the two analyses is negligible for additive and slight for dominant haplotype effects. For recessive haplotype effects, the more appropriate retrospective likelihood clearly outperforms the prospective likelihood. If an additive model is incorrectly used, as the true underlying genetic mode is unknown a priori, both retrospective and prospective penalized methods suffer from a sizeable power loss and increase in bias. The impact of using the incorrect genetic model is much bigger on retrospective analyses than prospective analyses, and results in comparable performances for both methods. An application of these methods to the Genetic Analysis Workshop 15 rheumatoid arthritis data is provided.
The quality control (QC) filtering of single nucleotide polymorphisms (SNPs) is an important step in genome-wide association studies to minimize potential false findings. SNP QC commonly uses expert-guided filters based on QC variables [e.g. Hardy-Weinberg equilibrium, missing proportion (MSP) and minor allele frequency (MAF)] to remove SNPs with insufficient genotyping quality. The rationale of the expert filters is sensible and concrete, but its implementation requires arbitrary thresholds and does not jointly consider all QC features.
We propose a variety of methods based on the generalized estimation equations to address the issues encountered in haplotype-based pharmacogenetic analysis, including analysis of longitudinal data with outcome-dependent dropouts, and evaluation of the high-dimensional haplotype and haplotype-drug interaction effects in an overall manner. We use the inverse probability weights to handle the outcome-dependent dropouts under the missing-at-random assumption, and incorporate the weighted L(1) penalty to select important main and interaction effects with high dimensionality. The proposed methods are easy to implement, computationally efficient, and provide an optimal balance between false positives and false negatives in detecting genetic effects.
Association methods based on haplotype similarity (HS) can overcome power and stability issues encountered in standard haplotype analyses. Current HS methods can be generally classified into evolutionary and two-sample approaches. We propose a new regression-based HS association method for case-control studies that incorporates covariate information and combines the advantages of the two classes of approaches by using inferred ancestral haplotypes. We first estimate the ancestral haplotypes of case individuals and then, for each individual, an ancestral-haplotype-based similarity score is computed by comparing that individuals observed genotype with the estimated ancestral haplotypes. Trait values are then regressed on the similarity scores. Covariates can easily be incorporated into this regression framework. To account for the bias in the raw p-values due to the use of case data in constructing ancestral haplotypes, as well as to account for variation in ancestral haplotype estimation, a permutation procedure is adopted to obtain empirical p-values. Compared with the standard haplotype score test and the multilocus T(2) test, our method improves power when neither the allele frequency nor linkage disequilibrium between the disease locus and its neighboring SNPs is too low and is comparable in other scenarios. We applied our method to the Genetic Analysis Workshop 15 simulated SNP data and successfully pinpointed a stretch of SNPs that covers the fine-scale region where the causal locus is located.
We propose a similarity-based regression method to detect associations between traits and multimarker genotypes. The model regresses similarity in traits for pairs of "unrelated" individuals on their haplotype similarities, and detects the significance by a score test for which the limiting distribution is derived. The proposed method allows for covariates, uses phase-independent similarity measures to bypass the needs to impute phase information, and is applicable to traits of general types (e.g., quantitative and qualitative traits). We also show that the gene-trait similarity regression is closely connected with random effects haplotype analysis, although commonly they are considered as separate modeling tools. This connection unites the classic haplotype sharing methods with the variance-component approaches, which enables direct derivation of analytical properties of the sharing statistics even when the similarity regression model becomes analytically challenging.
Prior genome-wide association studies (GWAS) of major depressive disorder (MDD) have met with limited success. We sought to increase statistical power to detect disease loci by conducting a GWAS mega-analysis for MDD. In the MDD discovery phase, we analyzed more than 1.2 million autosomal and X chromosome single-nucleotide polymorphisms (SNPs) in 18?759 independent and unrelated subjects of recent European ancestry (9240 MDD cases and 9519 controls). In the MDD replication phase, we evaluated 554 SNPs in independent samples (6783 MDD cases and 50?695 controls). We also conducted a cross-disorder meta-analysis using 819 autosomal SNPs with P<0.0001 for either MDD or the Psychiatric GWAS Consortium bipolar disorder (BIP) mega-analysis (9238 MDD cases/8039 controls and 6998 BIP cases/7775 controls). No SNPs achieved genome-wide significance in the MDD discovery phase, the MDD replication phase or in pre-planned secondary analyses (by sex, recurrent MDD, recurrent early-onset MDD, age of onset, pre-pubertal onset MDD or typical-like MDD from a latent class analyses of the MDD criteria). In the MDD-bipolar cross-disorder analysis, 15 SNPs exceeded genome-wide significance (P<5 × 10(-8)), and all were in a 248?kb interval of high LD on 3p21.1 (chr3:52?425?083-53?822?102, minimum P=5.9 × 10(-9) at rs2535629). Although this is the largest genome-wide analysis of MDD yet conducted, its high prevalence means that the sample is still underpowered to detect genetic effects typical for complex traits. Therefore, we were unable to identify robust and replicable findings. We discuss what this means for genetic research for MDD. The 3p21.1 MDD-BIP finding should be interpreted with caution as the most significant SNP did not replicate in MDD samples, and genotyping in independent samples will be needed to resolve its status.
Related JoVE Video
Journal of Visualized Experiments
What is Visualize?
JoVE Visualize is a tool created to match the last 5 years of PubMed publications to methods in JoVE's video library.
How does it work?
We use abstracts found on PubMed and match them to JoVE videos to create a list of 10 to 30 related methods videos.
Video X seems to be unrelated to Abstract Y...
In developing our video relationships, we compare around 5 million PubMed articles to our library of over 4,500 methods videos. In some cases the language used in the PubMed abstracts makes matching that content to a JoVE video difficult. In other cases, there happens not to be any content in our video library that is relevant to the topic of a given abstract. In these cases, our algorithms are trying their best to display videos with relevant content, which can sometimes result in matched videos with only a slight relation.