- Report
- Open Access
Conditional random field approach to prediction of protein-protein interactions using domain information
https://doi.org/10.1186/1752-0509-5-S1-S8
© Hayashida et al; licensee BioMed Central Ltd. 2011
- Published: 20 June 2011
Abstract
Background
For understanding cellular systems and biological networks, it is important to analyze functions and interactions of proteins and domains. Many methods for predicting protein-protein interactions have been developed. It is known that mutual information between residues at interacting sites can be higher than that at non-interacting sites. It is based on the thought that amino acid residues at interacting sites have coevolved with those at the corresponding residues in the partner proteins. Several studies have shown that such mutual information is useful for identifying contact residues in interacting proteins.
Results
We propose novel methods using conditional random fields for predicting protein-protein interactions. We focus on the mutual information between residues, and combine it with conditional random fields. In the methods, protein-protein interactions are modeled using domain-domain interactions. We perform computational experiments using protein-protein interaction datasets for several organisms, and calculate AUC (Area Under ROC Curve) score. The results suggest that our proposed methods with and without mutual information outperform EM (Expectation Maximization) method proposed by Deng et al., which is one of the best predictors based on domain-domain interactions.
Conclusions
We propose novel methods using conditional random fields with and without mutual information between domains. Our methods based on domain-domain interactions are useful for predicting protein-protein interactions.
Keywords
- Mutual Information
- Markov Random Field
- Protein Pair
- Conditional Random Field
- Domain Pair
Background
Understanding of protein functions and protein-protein interactions is one of important topics in the field of molecular biology and bioinformatics. Recently, many researchers have focused on the investigation of amino acid residues of proteins to reveal interactions and contacts between residues [1–4]. If residues at important sites for interactions between proteins are substituted in one protein, the corresponding residues in interacting partner proteins are expected to be also substituted by selection pressure. Otherwise, such mutated proteins may lose the interactions. Fraser et al. confirmed that interacting proteins evolve at similar evolutionary rates by comparing putatively orthologous protein sequences between S. cerevisiae and C. elegans[5]. It means that substitutions for contact residues occur in both interacting proteins as long as the proteins keep interacting with each other. Therefore, mutual information (MI) between residues is useful for predicting protein-protein interactions for proteins of unknown function. MI is calculated from multiple sequence alignments for homologous protein sequences. Weigt et al. identified direct residue contacts between sensor kinase and response regulator proteins by message passing, which is an improvement of MI [4]. Burger and van Nimwegen used a dependence tree where a node corresponds to a position of amino acid sequences, and predicted interactions using a Bayesian network method [2]. On the other hand, Markov random field and conditional random field models have been well studied in fields of natural language processing [6, 7]. Also in bioinformatics, protein function prediction methods from protein-protein interaction network and other biological networks were developed using Markov random fields [8, 9]. On the other hand, several prediction methods have been developed based on domain-domain interactions. Deng et al. proposed a domain-based probabilistic model of protein-protein interactions, and developed EM (Expectation Maximization) method [10]. Based on this probabilistic model, LP (Linear Programming)-based methods were developed [11], and Chen et al. improved the accuracy of interaction strength prediction by APM (Association Probabilistic Method) [12]. In this paper, we propose prediction methods based on domain-domain interactions using conditional random fields with and without mutual information. Furthermore, we perform computational experiments for several protein-protein interaction datasets, compare the methods with the EM method proposed by Deng et al. [10], which is one of the best predictors based on domain-domain interactions, and the association method proposed by Sprinzak and Margalit [13] (the APM method for binary interaction data is equivalent to the association method), and show that our methods outperform the EM method and the association method.
Mutual information between domains
In order to investigate the relationship between two positions of proteins, MI for distributions of amino acids at the positions is used. Such distributions can be obtained from multiple alignments of protein sequences and domain sequences. In this section, we briefly review MI for distributions of amino acids, and explain MI between domains.
Illustration on the calculation of mutual information from multiple alignments of domains Domains D m and D n have multiple alignments of sequences from several organisms, respectively. Mutual information is calculated for each pair of positions i and j.
Let A be a set of amino acids, f i (A) be the appearance frequency of amino acid A at position i in domains D m and D n , and f ij (A, B) be the joint appearance frequency of a pair of amino acids A at position i in D m and B at position j in D n , where each frequency is divided by the number of paired sequences M in the multiple alignments such that ∑A∈Af i (A) = ∑A,B∈Af ij (A,B) = 1.
If frequency distributions of amino acids at positions i and j are independent from each other, f ij (A,B) ≈ f i (A)f j (B), and MI ij approaches to zero. This means that the two positions are not related with each other in the evolutionary process. If domains D m and D n interact at the positions, it is considered that MI ij becomes high because the positions have coevolved through the evolutionary process in order to keep the interaction. It should be noted that two positions i and j do not always directly interact even if MI ij is high [4]. However, such proteins with high values of MI have a possibility to directly interact with each other at other positions in the proteins.

where η is a constant value, in this paper we use η = 1. It should be noted that the sum over all amino acids A,
and
because ∑A∈Af
i
(A) = ∑A,B∈Af
ij
(A,B) = 1.
where 〈v〉 means the average of v, i and j are positions of D m and D n , respectively. Since MI ij is calculated to be high for the positions i and j that include many gaps, we exclude positions that include more than 20% gaps as in [14].
Conditional random field model for PPI
Markov random field model for protein-protein interactions Left: Example of proteins P i and P j . P i consists of domains D1 and D2, and P j consists of domain D3, respectively. Right: Factor graph G(U,V,E). There exists an edge between P ij ∈ U and D mn ∈ V if and only if D mn ∈ P ij .
where λ(mn) = log(1 – Pr(D mn = 1)).
where p
ij
∈ {0, 1}, d means a set of events on domain-domain interactions, D
mn
= d
mn
(d
mn
∈ {0, 1}),
denotes a local feature,
is the corresponding weight parameter and related to the joint probability Pr(P
ij
= s, D
mn
= t), and Z
ij
denotes the normalization constant. For instance, equation (8) for p
ij
= 0 is equivalent to equation (7) in the case that
for all protein pairs (P
i
, P
j
) and
if s = t = 0, otherwise 0.
In Markov random fields, random variables have Markov properties represented as an undirected graph [15]. The factor graph for our model is represented to be a bipartite graph G(U, V, E) with a set of vertices U corresponding to protein-protein interactions P ij , a set of vertices V corresponding to domain-domain interactions D mn , and a set of edges E between U and V as the right figure of Figure 2. There exists an edge between P ij ∈ U and D mn ∈ V if and only if D mn ∈ P ij . For the left example of Figure 2, protein pair (P i , P j ) includes domain pairs (D1, D3) and (D2, D3). Then, in the factor graph, the vertex of P ij is connected with vertices of D13 and D23, respectively. Although the vertex of P ij does not have other adjacent vertices than the vertices of D13 and D23, those of D13 and D23 can be connected with other vertices than that of P ij



where p means a set of events on protein-protein interactions, P ij = p ij .
σ(x) = 1/(1 + e–x) is an increasing function, and c is a positive constant. It should be noted that a negative value, –1, is given to
because it is undesired that a pair of domains interact although proteins having the pair do not interact. In this way, the local feature
correlates protein-protein interactions P
ij
with domain-domain interactions D
mn
(see Figure 2).

Parameter estimation


In the BFGS method, this equation is repeatedly applied for updating a solution.
Computational experiments
Data and implementation
We used protein-protein interaction data of H. sapiens, D. melanogaster, and C. elegans from the DIP database [17], the file name is ’dip20091230.txt’. We used the UniProt Knowledgebase database (version 15.4) [18] as protein domain inclusion data. We deleted proteins that did not have any domain, and obtained 294 interacting protein pairs as positive data that included 300 distinct proteins and 320 domains for H. sapiens, 449 interacting pairs that included 562 proteins and 449 domains for D. melanogaster, and 250 interacting pairs that included 602 proteins and 476 domains for C. elegans.
Distributions of domain MIs for H. sapiens , D. melanogaster , and C. elegans
We selected non-interacting protein pairs as negative data uniformly at random such that negative data did not overlap with the positive data. The number of negative data was the same as that of positive data for each organism.
We used libLBFGS (version 1.9) with default parameters to estimate the parameters
, which is a C implementation of the limited memory BFGS method [20], and is available on the web page, http://www.chokkan.org/software/liblbfgs/.
Result
In order to evaluate our method, we compared the proposed CRF method with MI and that without MI with the EM method by Deng et al. [10] and the association method proposed by Sprinzak and Margalit [13]. The association method and the APM method [12] estimate probabilities λ
mn
that domains D
m
and D
n
interact as
and
, respectively, where N
mn
(I
mn
) denotes the number of (interacting) protein pairs that include domain pair (D
m
, D
n
), and ρ
ij
denotes the interaction strength of protein pair (P
i
, P
j
), 0 ≤ ρ
ij
≤ 1. However, our input interaction data are binary, that is, ρ
ij
takes only 0 or 1. Then, the numerator of the APM method becomes I
mn
. It means that the APM method for binary interaction data is equivalent to the association method. In the EM method, probabilities λ
mn
that domains D
m
and D
n
interact are estimated by the recursive formula,
, where o
ij
= 1 denotes that it was observed that proteins P
i
and P
j
interact with each other, and fn = 0.8. In this paper, the solution of the association method was given as the initial value
of the EM method.

The AUC results for training and test datasets of H. sapiens by the CRF method with MI, that without MI, the EM method, and the association method
iteration | CRF with MI | CRF without MI | EM | Assoc | ||||
---|---|---|---|---|---|---|---|---|
training | test | training | test | training | test | training | test | |
1st | 0.999366 | 0.989247 | 0.999366 | 0.881720 | 0.999819 | 0.709677 | 0.999602 | 0.709677 |
2nd | 0.998787 | 0.919355 | 0.999312 | 0.923387 | 0.999909 | 0.875000 | 0.999330 | 0.854839 |
3rd | 1.000000 | 0.847222 | 1.000000 | 0.833333 | 1.000000 | 0.861111 | 1.000000 | 0.861111 |
4th | 0.999351 | 0.989583 | 0.999369 | 1.000000 | 0.999856 | 0.989583 | 0.999351 | 0.989583 |
5th | 0.999333 | 0.842365 | 0.999369 | 0.827586 | 0.999982 | 0.798030 | 0.999802 | 0.798030 |
average | 0.999367 | 0.917554 | 0.999483 | 0.893205 | 0.999913 | 0.846680 | 0.999617 | 0.842648 |
The AUC results for training and test datasets of D. melanogaster by the CRF method with MI, that without MI, the EM method, and the association method
iteration | CRF with MI | CRF without MI | EM | Assoc | ||||
---|---|---|---|---|---|---|---|---|
training | test | training | test | training | test | training | test | |
1st | 0.999255 | 0.707692 | 0.999977 | 0.738462 | 0.999961 | 0.769231 | 0.999938 | 0.769231 |
2nd | 0.997928 | 0.818182 | 0.997905 | 0.848485 | 0.999938 | 0.727273 | 0.999736 | 0.727273 |
3rd | 0.997920 | 0.708333 | 0.997920 | 0.562500 | 0.999922 | 0.645833 | 0.999884 | 0.625000 |
4th | 0.998660 | 0.863636 | 0.999318 | 0.886364 | 0.999814 | 0.840909 | 0.999853 | 0.840909 |
5th | 0.999234 | 0.819444 | 0.999954 | 0.833333 | 0.999861 | 0.527778 | 0.999923 | 0.527778 |
average | 0.998599 | 0.783458 | 0.999015 | 0.773829 | 0.999899 | 0.702205 | 0.999867 | 0.698038 |
The AUC results for training and test datasets of C. elegans by the CRF method with MI, that without MI, the EM method, and the association method
iteration | CRF with MI | CRF without MI | EM | Assoc | ||||
---|---|---|---|---|---|---|---|---|
training | test | training | test | training | test | training | test | |
1st | 0.999975 | 0.657143 | 0.999975 | 0.514286 | 1.000000 | 0.542857 | 1.000000 | 0.542857 |
2nd | 0.997899 | 0.923077 | 0.996873 | 0.948718 | 0.999875 | 0.743590 | 0.999825 | 0.743590 |
3rd | 0.998775 | 0.900000 | 0.998825 | 0.933333 | 0.999875 | 0.866667 | 0.999825 | 0.866667 |
4th | 0.998950 | 0.966667 | 0.999850 | 0.966667 | 0.999850 | 0.633333 | 0.999850 | 0.633333 |
5th | 0.998900 | 1.000000 | 0.998875 | 1.000000 | 0.999675 | 1.000000 | 0.999700 | 1.000000 |
average | 0.998900 | 0.889377 | 0.998879 | 0.872601 | 0.999855 | 0.757289 | 0.999840 | 0.757289 |
Average ROC curves for test datasets of H. sapiens by the CRF method with MI, that without MI, the EM method, and the association method
Average ROC curves for test datasets of D. melanogaster by the CRF method with MI, that without MI, the EM method, and the association method
Average ROC curves for test datasets of C. elegans by the CRF method with MI, that without MI, the EM method, and the association method
Conclusions
We proposed novel methods which combine conditional random fields with the domain-based model of protein-protein interactions. In order to give better performance, we introduced mutual information to the probabilistic model. In the improved model, mutual information between domains is given as conditions, where MI between domains is defined as the maximum of MIs between residues in the domains. This method was developed based on the fact that amino acid residues at important sites for interactions have coevolved with each other, and MI has been used for identifying contact residues in interactions. We performed five-fold cross-validation experiments, and calculated AUC for probabilities that two proteins interact. The results suggested that our proposed methods, especially the CRF method with mutual information, are useful. However, the results of AUC for training datasets implied that estimated parameters were overfitting to training datasets. For avoiding that problem, we can improve the methods, for instance, by adding regularization terms, l1-norm of parameters to the log-likelihood function. Since CRF has an advantage to be able to incorporate large number of features, it remains as a future work to improve the model itself to obtain better accuracy by, for instance, modifying the local feature and adding new features.
Authors contributions
JS proposed the use of mutual information for predicting protein-protein interactions. Methods were developed and implemented by MH. MK and TA participated in the discussion during development of the methods. The manuscript was prepared by MH, JS, and TA.
Declarations
Acknowledgements
This work was partially supported by Grants-in-Aid #22240009 and #21700323 from MEXT, Japan. JS would like to thank the National Health and Medical Research Council of Australia (NHMRC) and the Chinese Academy of Sciences (CAS) for financially supporting this research via the NHMRC Peter Doherty Fellowship and the Hundred Talents Program of CAS.
This article has been published as part of BMC Systems Biology Volume 5 Supplement 1, 2011: Selected articles from the 4th International Conference on Computational Systems Biology (ISB 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/5?issue=S1.
Authors’ Affiliations
References
- White RA, Szurmant H, Hoch JA, Hwa T: Features of protein-protein interactions in two-component signaling deduced from genomic libraries. Methods Enzymol. 2007, 422: 75-101. full_text. full_textView ArticlePubMedGoogle Scholar
- Burger L, van Nimwegen E: Accurate prediction of protein-protein interactions from sequence alignments using a Bayesian method. Molecular Systems Biology. 2008, 4: 165-10.1038/msb4100203.PubMed CentralView ArticlePubMedGoogle Scholar
- Halabi N, Rivoire O, Leibler S, Ranganathan R: Protein sectors: Evolutionary units of three-dimensional structure. Cell. 2009, 138: 774-786. 10.1016/j.cell.2009.07.038.PubMed CentralView ArticlePubMedGoogle Scholar
- Weigt M, White RA, Szurmant H, Hoch JA, Hwa T: Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Natl. Acad. Sci. USA. 2009, 106: 67-72. 10.1073/pnas.0805923106.PubMed CentralView ArticlePubMedGoogle Scholar
- Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW: Evolutionary rate in the protein interaction network. Science. 2002, 296: 750-752. 10.1126/science.1068696.View ArticlePubMedGoogle Scholar
- Sha F, Pereira F: Shallow parsing with conditional random fields. Proc. HLT-NAACL 2003. 2003, 134-141.Google Scholar
- Sutton C, McCallum A: An introduction to conditional random fields for relational learning. Introduction to statistical relational learning. 2006, MIT Press, 93-128.Google Scholar
- Deng M, Zhang K, Mehta S, Chen T, Sun F: Prediction of protein function using protein-protein interaction data. Journal of Computational Biology. 2003, 10 (6): 947-960. 10.1089/106652703322756168.View ArticlePubMedGoogle Scholar
- Deng M, Chen T, Sun F: An integrated probabilistic model for functional prediction of proteins. Journal of Computational Biology. 2004, 11: 463-475. 10.1089/1066527041410346.View ArticlePubMedGoogle Scholar
- Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain interactions from protein-protein interactions. Genome Research. 2002, 12: 1540-1548. 10.1101/gr.153002.PubMed CentralView ArticlePubMedGoogle Scholar
- Hayashida M, Ueda N, Akutsu T: Inferring strengths of protein-protein interactions from experimental data using linear programming. Bioinformatics. 2003, 19 (suppl 2): ii58-ii65. 10.1093/bioinformatics/btg1061.View ArticlePubMedGoogle Scholar
- Chen L, Wu LY, Wang Y, Zhang XS: Inferring protein interactions from experimental data by association probabilistic method. Proteins. 2006, 62 (4): 833-837. 10.1002/prot.20783.View ArticlePubMedGoogle Scholar
- Sprinzak E, Margalit H: Correlated sequence-signatures as markers of protein-protein interaction. Journal of Molecular Biology. 2001, 311: 681-692. 10.1006/jmbi.2001.4920.View ArticlePubMedGoogle Scholar
- Little DY, Chen L: Identification of coevolving residues and coevolution potentials emphasizing structure, bond formation and catalytic coordination in protein evolution. PLoS One. 2009, 4: e4762-10.1371/journal.pone.0004762.PubMed CentralView ArticlePubMedGoogle Scholar
- Moussouri J: Gibbs and Markov random systems with constraints. Journal of Statistical Physics. 1974, 10: 11-33. 10.1007/BF01011714.View ArticleGoogle Scholar
- Bertsekas DP: Nonlinear Programming. 1999, Athena ScientificGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Research. 2004, 32: D449-D451. 10.1093/nar/gkh086.PubMed CentralView ArticlePubMedGoogle Scholar
- The UniProt Consortium: The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research. 2010, 38: D142-D148.PubMed CentralView ArticleGoogle Scholar
- Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer ELL, Eddy SR, Bateman A: The Pfam protein families database. Nucleic Acids Research. 2010, 38: D211-D222. 10.1093/nar/gkp985.PubMed CentralView ArticlePubMedGoogle Scholar
- Nocedal J: Updating quasi-Newton matrices with limited storage. Mathematics of Computation. 1980, 35 (151): 773-782. 10.1090/S0025-5718-1980-0572855-7.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.