Multivariate generalized multifactor dimensionality reduction to detect gene-gene interactions
Jiin Choi
Taesung Park
https://doi.org/10.1186/1752-0509-7-S6-S15
© Choi and Park; licensee BioMed Central Ltd. 2013
Published: 13 December 2013
Abstract
Background
Recently, one of the greatest challenges in genome-wide association studies is to detect gene-gene and/or gene-environment interactions for common complex human diseases. Ritchie et al. (2001) proposed multifactor dimensionality reduction (MDR) method for interaction analysis. MDR is a combinatorial approach to reduce multi-locus genotypes into high-risk and low-risk groups. Although MDR has been widely used for case-control studies with binary phenotypes, several extensions have been proposed. One of these methods, a generalized MDR (GMDR) proposed by Lou et al. (2007), allows adjusting for covariates and applying to both dichotomous and continuous phenotypes. GMDR uses the residual score of a generalized linear model of phenotypes to assign either high-risk or low-risk group, while MDR uses the ratio of cases to controls.
Methods
In this study, we propose multivariate GMDR, an extension of GMDR for multivariate phenotypes. Jointly analysing correlated multivariate phenotypes may have more power to detect susceptible genes and gene-gene interactions. We construct generalized estimating equations (GEE) with multivariate phenotypes to extend generalized linear models. Using the score vectors from GEE we discriminate high-risk from low-risk groups. We applied the multivariate GMDR method to the blood pressure data of the 7,546 subjects from the Korean Association Resource study: systolic blood pressure (SBP) and diastolic blood pressure (DBP). We compare the results of multivariate GMDR for SBP and DBP to the results from separate univariate GMDR for SBP and DBP, respectively. We also applied the multivariate GMDR method to the repeatedly measured hypertension status from 5,466 subjects and compared its result with those of univariate GMDR at each time point.
Results
Results from the univariate GMDR and multivariate GMDR in two-locus model with both blood pressures and hypertension phenotypes indicate best combinations of SNPs whose interaction has significant association with risk for high blood pressures or hypertension. Although the test balanced accuracy (BA) of multivariate analysis was not always greater than that of univariate analysis, the multivariate BAs were more stable with smaller standard deviations.
Conclusions
In this study, we have developed multivariate GMDR method using GEE approach. It is useful to use multivariate GMDR with correlated multiple phenotypes of interests.
Keywords
- Generalize Estimate Equation
- Multifactor Dimensionality Reduction
- Generalize Estimate Equation Model
- Continuous Phenotype
- FBN1 Gene
Background
Genome-wide association studies (GWAS) have been successfully conducted to detect disease susceptibility genes for common complex human diseases by focusing on associations between single-nucleotide polymorphisms (SNPs) and phenotypes [1]. While traditional methods for GWAS consider only one SNP at a time, some common complex human diseases such as diabetes, hypertension, and various types of cancers are known to be influenced by multiple genetic variants [2]. In addition, one of the greatest challenges in GWAS is to discover gene-gene and/or gene-environment interactions.
Classic logistic regression can be used to analyze the gene-gene interaction [3]. However, logistic regression suffers from an overfitting problem in high-order interactions [4]. Multifactor dimensionality reduction (MDR) method is a nonparametric, model-free, and combinatorial approach for interaction analysis by identification of a multi-locus model for association in case-control studies [5–9]. MDR method reduces multi-locus genotypes into two disease risk groups: high-risk and low-risk groups. If the ratio of cases and controls in a combination of genotypes is larger than a pre-assigned threshold T (e.g., T = 1), the cell of combination is labelled as "high risk", otherwise, "low risk". MDR method shows greater power for testing high-order interactions compared with logistic regression analysis [10]. Several statistical methods have been extended from MDR approach [11–16]. One of the extended methods of MDR is a generalized MDR (GMDR) proposed by Lou et al. [16]. GMDR method allows adjusting for covariates and applying to both dichotomous and continuous phenotypes; it uses the score-based statistic obtained from generalized linear model of phenotypes on the predictor-variable and covariates instead of the ratio of cases and controls in original MDR method.
These GWAS methods are generally implemented in a univariate framework analysing one phenotype at a time even though multiple phenotypes of interest are collected from a study population. In particular, pleiotropy that occurs due to potential genetic correlation between multiple phenotypic traits plays a role in pathogenesis of correlated human diseases [17]. Jointly analysing correlated multivariate phenotypes may have more power to detect susceptible genes and gene-gene interactions by using more information from data. Classic multivariate methods such as likelihood based mixed effects model [18, 19] and generalize estimating equations (GEE) [20], and extended versions of these methods [21, 22] can be applied to multivariate phenotypes of GWAS.
In this study, we have proposed multivariate GMDR method by extending GMDR method for the multivariate phenotypes. We construct GEE model with multivariate phenotypes to extend generalized linear models. The GEE approach is exceptionally useful method for the analysis of longitudinal data, especially when the response variable is discrete [23]. Using the score vectors from GEE, we discriminate high-risk from low-risk groups. The proposed multivariate GMDR method can also handle the repeatedly measured phenotypes.
We apply the proposed multivariate GMDR method to the Korean Association Resource study on blood pressure: systolic blood pressure (SBP) and diastolic blood pressure (DBP). A number of authors have investigated the genome-wide association studies on blood pressure and hypertension for Korean population [24–26] and for others [27–30]. However, not much work has been done for gene-gene interaction analyses. We compare the results of multivariate GMDR for SBP and DBP to the results from original univariate GMDR for SBP and DBP, respectively. We also apply the multivariate GMDR method to the repeated measured hypertension phenotypes and compare its result with those from univariate GMDR at each time point.
Methods
Multivariate GMDR
where ${\widehat{\mathit{\mu}}}_{i}={g}^{-1}\left({\mathit{Z}}_{i}\widehat{\mathit{\gamma}}\right)$ and $\widehat{\mathit{\gamma}}$ is estimator obtained from estimating equations under the null hypothesis ${H}_{0}:\mathit{\beta}=0$. ${\widehat{B}}_{i}$ and ${\widehat{\mathit{V}}}_{i}$ are calculated using ${\widehat{\mathit{\mu}}}_{i}$. Based on this residual score vector, each individual with phenotypes is discriminated between case and control status. From the residual score vector for individual, we propose the aggregation for elements of the score vector, ${S}_{i}=\sum _{j=1}^{t}{S}_{ij}$, and use that as a prediction score for each individual. If the sum of prediction scores over those individuals who have the corresponding genotype combination is greater than a threshold value, assign 'high-risk' to the cell corresponding to the genotype combination. Otherwise, assign 'low-risk' to the cell.
Data
Subject characteristics of the KARE.
Phenotype | N(=7,546) | % | |
---|---|---|---|
Recruit area | |||
Ansung | 3,466 | 45.9 | |
Ansan | 4,080 | 54.1 | |
Gender | |||
Male | 3,743 | 49.6 | |
Female | 3,803 | 50.4 | |
Systolic blood pressure | |||
≥ 140 | 701 | 9.3 | |
< 140 | 6,845 | 90.7 | |
Diastolic blood pressure | |||
≥ 90 | 693 | 9.2 | |
< 90 | 6,853 | 90.8 | |
Age (years) | Mean | SD | |
Overall | 51.4 | 8.79 | |
Ansung | 55.0 | 8.82 | |
Ansan | 48.4 | 7.51 | |
Body mass index (kg/m^{2}) | |||
Overall | 24.4 | 3.08 | |
Hypertensive cases | N*(=5,466) | % | |
(SBP ≥ 140 or | HP_{1} (Time 1) | 716 | 13.1 |
DBP ≥ 90) | HP_{2} (Time 2) | 706 | 12.9 |
HP_{3} (Time 3) | 698 | 12.8 |
Results
Preliminary analyses
To compare multivariate analysis with univariate analysis, we first separately fit a logistic regression model for each dichotomized blood pressure measurement SBP_{B} and DBP_{B} with covariate adjustment for recruitment area, age, sex, and BMI. The correlation between SBP_{B} and DBP_{B} is 0.48. The multivariate analysis with two binary phenotypes (SBP_{B}, DBP_{B}) was conducted using the GEE approach. For the repeatedly measured hypertension status HP_{1}, HP_{2}, and HP_{3}, we fit logistic models for each HP_{i} and fit the GEE model for three HPs simultaneously. The pairwise correlations range from 0.32 to 0.36. In the GEE model, we assumed two types of genetic effect: homogeneous genetic effect and heterogeneous genetic effect for multivariate phenotypes. However, when we compared the effect sizes and p-values of homogeneous model with those of heterogeneous model, there was no strong evidence for supporting the homogeneous genetic effect. So, we present the results of the GEE model with heterogeneous genetic effects for multivariate phenotypes in both of blood pressures and repeatedly measured hypertension status.
To perform gene-gene interaction analysis using GMDR analyses, we first selected SNPs with strong marginal effects in univariate models and among those, we select the ones with strong effects in multivariate models. For SBP_{B} and DBP_{B} analysis, we selected the top 50 SNPs for each SBP_{B} and DBP_{B}. From these 100 SNPs, we chose 35 SNPs using a p-value criterion (< 1 × 10^{-4}) from the GEE model. In a similar manner, we chose 34 SNPs for HP_{1}, HP_{2}, and HP_{3} by selecting the top 50 SNPs for each HP_{i} using the same p-value criterion from their GEE model.
Univariate logistic and multivariate GEE analyses of SBP_{B} and DBP_{B}
Selected SNPs of SBP and DBP from univariate and multivariate analyses.
CHR | SNP | Gene symbol | SBP | DBP | Multivariate | ||||
---|---|---|---|---|---|---|---|---|---|
Beta | P-value | Beta | P-value | Beta1 | Beta2 | P-value | |||
1 | rs7555790 | PEX14 | 0.117 | 4.16E-03 | 0.184 | 4.46E-06 | 0.046 | 0.116 | 2.35E-05 |
2 | rs2111464 | 0.200 | 1.11E-06 | 0.100 | 1.28E-02 | 0.293 | 0.195 | 8.77E-06 | |
2 | rs1549022 | 0.207 | 6.52E-07 | 0.111 | 5.89E-03 | 0.295 | 0.202 | 5.23E-06 | |
3 | rs1768145 | 0.169 | 8.24E-06 | 0.090 | 2.01E-02 | 0.233 | 0.161 | 7.95E-05 | |
4 | rs17045441 | ANK2 | 0.065 | 1.06E-01 | 0.199 | 7.69E-08 | -0.090 | 0.058 | 3.91E-08 |
4 | rs2088983 | 0.168 | 6.96E-06 | 0.090 | 1.82E-02 | 0.234 | 0.162 | 4.54E-05 | |
15 | rs1378942 | CSK | -0.189 | 2.50E-05 | -0.192 | 1.85E-05 | -0.167 | -0.182 | 3.49E-06 |
16 | rs11866964 | ZNF423 | -0.089 | 3.66E-02 | -0.206 | 2.78E-06 | 0.036 | -0.087 | 3.26E-05 |
17 | rs12942470 | 0.186 | 4.36E-06 | 0.041 | 3.12E-01 | 0.326 | 0.180 | 4.25E-06 | |
20 | rs927833 | LOC100270679 | -0.127 | 2.31E-02 | 0.074 | 4.53E-02 | -0.343 | -0.130 | 7.43E-06 |
Univariate logistic and multivariate GEE analyses of HP_{1}, HP_{2}, and HP_{3}
Selected SNPs of longitudinal hypertension from univariate and multivariate analyses.
CHR | SNP | Gene symbol | HP_{1} | HP_{2} | HP_{3} | Multivariate | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Beta | P-value | Beta | P-value | Beta | P-value | Beta1 | Beta2 | Beta3 | P-value | |||
1 | rs4908736 | 0.111 | 6.02E-03 | 0.178 | 8.83E-06 | 0.079 | 5.17E-02 | 0.110 | 0.178 | 0.079 | 1.21E-04 | |
4 | rs17675997 | 0.176 | 6.16E-06 | 0.051 | 2.11E-01 | 0.049 | 2.27E-01 | 0.175 | 0.051 | 0.048 | 1.27E-04 | |
4 | rs2411259 | LOC152578 | 0.176 | 5.33E-06 | 0.051 | 2.06E-01 | 0.061 | 1.28E-01 | 0.174 | 0.051 | 0.061 | 1.13E-04 |
5 | rs12054837 | ARSB | -0.029 | 4.83E-01 | -0.042 | 3.18E-01 | 0.162 | 2.12E-05 | -0.031 | -0.043 | 0.168 | 2.95E-06 |
5 | rs294082 | 0.067 | 1.02E-01 | 0.087 | 3.28E-02 | 0.181 | 5.80E-06 | 0.068 | 0.087 | 0.181 | 1.71E-04 | |
5 | rs17677051 | -0.086 | 3.79E-02 | -0.188 | 7.84E-06 | -0.089 | 3.16E-02 | -0.081 | -0.188 | -0.093 | 8.09E-05 | |
5 | rs4867707 | -0.088 | 3.22E-02 | -0.189 | 7.00E-06 | -0.091 | 2.77E-02 | -0.083 | -0.188 | -0.095 | 6.80E-05 | |
6 | rs4084097 | 0.163 | 9.61E-06 | -0.004 | 9.27E-01 | 0.092 | 1.68E-02 | 0.158 | -0.005 | 0.095 | 8.05E-06 | |
6 | rs7751214 | EPHA7 | -0.191 | 9.16E-06 | -0.009 | 8.26E-01 | -0.099 | 1.85E-02 | -0.190 | -0.008 | -0.100 | 1.39E-05 |
8 | rs4495407 | 0.038 | 3.60E-01 | 0.012 | 7.74E-01 | 0.185 | 8.40E-06 | 0.036 | 0.012 | 0.189 | 5.59E-05 | |
8 | rs10956596 | -0.044 | 2.82E-01 | -0.047 | 2.58E-01 | -0.185 | 8.82E-06 | -0.043 | -0.047 | -0.188 | 1.23E-04 | |
8 | rs6470947 | 0.053 | 1.94E-01 | 0.023 | 5.69E-01 | 0.187 | 6.69E-06 | 0.053 | 0.023 | 0.190 | 6.05E-05 | |
8 | rs4615555 | 0.051 | 2.17E-01 | 0.030 | 4.69E-01 | 0.191 | 3.81E-06 | 0.049 | 0.029 | 0.194 | 3.34E-05 | |
8 | rs4279577 | 0.052 | 2.06E-01 | 0.031 | 4.56E-01 | 0.192 | 3.44E-06 | 0.051 | 0.030 | 0.196 | 3.26E-05 | |
8 | rs7465333 | 0.050 | 2.31E-01 | 0.031 | 4.57E-01 | 0.189 | 6.33E-06 | 0.048 | 0.031 | 0.193 | 5.75E-05 | |
11 | rs550214 | 0.081 | 4.38E-02 | 0.175 | 6.09E-06 | 0.102 | 1.01E-02 | 0.077 | 0.174 | 0.106 | 8.32E-05 | |
15 | rs11636344 | FBN1 | 0.075 | 5.81E-02 | 0.167 | 6.51E-06 | 0.035 | 3.88E-01 | 0.073 | 0.166 | 0.037 | 1.06E-04 |
16 | rs17722281 | WWOX | -0.142 | 7.68E-04 | -0.160 | 1.52E-04 | 0.034 | 4.16E-01 | -0.140 | -0.161 | 0.034 | 7.66E-06 |
Transition of hypertensive case over time.
HP_{1} Time 1 (716) | |||||
---|---|---|---|---|---|
Hypertension | Normal | ||||
HP_{3} Time 3 (288) | HP_{3} Time 3 (410) | ||||
Hypertension | Normal | Hypertension | Normal | ||
HP_{2} Time 2 (706) | Hyper- tension | 166 | 154 | 147 | 239 |
Normal | 122 | 274 | 263 | 4101 |
Univariate GMDR and multivariate GMDR analyses of SBP_{B} and DBP_{B}
We present GMDR results to discover gene-gene and/or gene-environment interactions. For univariate GMDR analysis, logistic regression models for dichotomized SBP_{B} and DBP_{B} were constructed with area, age, sex, and BMI as covariates under the null hypothesis of no genetic effect. For multivariate GMDR analysis, the GEE model with same covariates was constructed. To reduce the computational burden, we focused on 35 SNPs selected from the preliminary analysis. All possible one and two locus models were fit for 35 SNPs. Through 10-fold-cross validation the best combination of loci with maximum train balanced accuracy (BA) which is average of sensitivity and specificity was chosen at each fold. To choose the final model, we considered cross-validation consistency (CVC) among a set of best combinations.
Comparison of results for SBP and DBP by GMDR and multivariate GMDR.
No. of Loci | Method | Best model | Train BA | Test BA | CVC |
---|---|---|---|---|---|
1 | GMDR_SBP | rs1549022 | 0.544 | 0.544 | 6 |
GMDR_DBP | rs11077135 | 0.548 | 0.547 | 7 | |
Multivariate GMDR | rs11866964 | 0.539 | 0.536 | 8 | |
2 | GMDR_SBP | rs2111464, rs12942470 | 0.566 | 0.566 | 7 |
GMDR_DBP | rs1378942, rs11866964 | 0.566 | 0.566 | 3 | |
Multivariate GMDR | rs7555790, rs11077135 | 0.551 | 0.546 | 2 |
Univariate GMDR and multivariate GMDR analyses of HP_{1}, HP_{2}, and HP_{3}
Comparison of results for longitudinal hypertension by GMDR and multivariate GMDR.
No. of Loci | Method | Best model | Train BA | Test BA | CVC |
---|---|---|---|---|---|
1 | GMDR_ HP_{1} | rs11097953 | 0.542 | 0.543 | 9 |
GMDR_ HP_{2} | rs11115097 | 0.545 | 0.546 | 5 | |
GMDR_ HP_{3} | rs7465333 | 0.540 | 0.542 | 5 | |
Multivariate GMDR | rs7168365 | 0.529 | 0.528 | 9 | |
2 | GMDR_ HP_{1} | rs11097953, rs7751214 | 0.555 | 0.540 | 6 |
GMDR_ HP_{2} | rs11115097, rs17722281 | 0.566 | 0.566 | 8 | |
GMDR_ HP_{3} | rs7791839, rs6470947 | 0.563 | 0.563 | 9 | |
Multivariate GMDR | rs7791839, rs7168365 | 0.544 | 0.544 | 7 |
Comparison of univariate GMDR and multivariate GMDR
Comparison of results for SBP and DBP by multivariate GMDR and hypertension at time 1 (HP_{1}) by GMDR.
No. of Loci | Method | Best model | Train BA | Test BA | CVC |
---|---|---|---|---|---|
1 | Multivariate GMDR with BPs | rs11866964 | 0.539 | 0.536 | 9 |
GMDR with HP_{1} | rs4811719 | 0.542 | 0.541 | 4 | |
2 | Multivariate GMDR with BPs | rs1338574, rs4811719 | 0.560 | 0.557 | 7 |
GMDR with HP_{1} | rs1338574, rs4811719 | 0.560 | 0.554 | 7 |
Conclusions
In this paper, we have developed multivariate analysis for discovering gene-gene interaction, namely multivariate GMDR. Our multivariate GMDR analysis was developed by utilizing a GEE approach to multivariate phenotypes. Many studies emphasized the importance and the increase of power for multivariate analysis in GWAS [33–35]. Although MDR method has been developed in variety of manners [5–9], there have been no extensions to the multivariate analysis. We proposed multivariate GMDR analysis by utilizing the GEE model to calculate the prediction score to be a tool for reducing the multifactor dimensionality. The GEE approach is an extension of generalized linear models to the longitudinal data and handles both discrete and continuous phenotypes. Thus, our multivariate GMDR can be applicable to both discrete and continuous phenotypes.
Though real GWAS data analysis, we investigated the properties of multivariate GMDR. Firstly, the result of multivariate GMDR does not always coincide with that of GEE approach. That is, the best SNP set selected by multivariate GMDR does not always have the smallest p-value from GEE model. In our analysis, note that the SNP set selected by multivariate GMDR still tends to have quite a small p-value. Secondly, the test BAs of the multivariate GMDR is not always larger than those of univariate GMDR. As shown in Figures 3 to 5, the distribution of test BAs from the multivariate GMDR is different from those of univariate GMDR. The test BAs of multivariate GMDR are more densely distributed with a smaller standard deviation than those of univariate GMDR. Thus, a direct comparison of test BAs between multivariate GMDR and univariate GMDR may lead a misleading conclusion.
The proposed multivariate GMDR can be extended in many different ways. The modified version BAs which takes account for the distributional difference is expected to improve the performance of multivariate GMDR. The testing procedure using the modified BAs under the null distribution would enable us to demonstrate the increase of power of multivariate GMDR. A prediction score is defined as the sum of elements of the score vector from GEE model. We are currently working on several different weighting schemes for accounting various relationships between phenotypes. The weighted prediction score is also expected to improve the performance of multivariate GMDR. In the future studies, all these extensions will be evaluated through extensive simulation studies.
