Polymorphisms How to Read Variants in Amino Acids

  • Loading metrics

FunSAV: Predicting the Functional Effect of Single Amino Acid Variants Using a Two-Phase Random Forest Model

  • Mingjun Wang,
  • Xing-Ming Zhao,
  • Kazuhiro Takemoto,
  • Haisong Xu,
  • Yuan Li,
  • Tatsuya Akutsu,
  • Jiangning Song

PLOS

x

  • Published: August 24, 2012
  • https://doi.org/10.1371/journal.pone.0043847

Abstruse

Single amino acid variants (SAVs) are the most arable form of known genetic variations associated with human illness. Successful prediction of the functional impact of SAVs from sequences tin can thus lead to an improved understanding of the underlying mechanisms of why a SAV may be associated with certain disease. In this work, nosotros synthetic a high-quality structural dataset that contained 679 high-quality protein structures with ii,048 SAVs by collecting the human genetic variant data from multiple resources and dividing them into two categories, i.e., illness-associated and neutral variants. We built a two-stage random wood (RF) model, termed as FunSAV, to predict the functional effect of SAVs by combining sequence, structure and balance-contact network features with other additional features that were not explored in previous studies. Chiefly, a two-step feature selection procedure was proposed to select the most important and informative features that contribute to the prediction of disease association of SAVs. In cross-validation experiments on the benchmark dataset, FunSAV achieved a skilful prediction functioning with the surface area under the curve (AUC) of 0.882, which is competitive with and in some cases better than other existing tools including SIFT, SNAP, Polyphen2, PANTHER, nsSNPAnalyzer and PhD-SNP. The sourcecodes of FunSAV and the datasets tin exist downloaded at http://sunflower.kuicr.kyoto-u.ac.jp/sjn/FunSAV.

Introduction

With the rapid progress of genomic profiling technologies such as single nucleotide polymorphism allele genotyping arrays and next-generation DNA sequencing, an unprecedented amount of information nigh single amino acid variants (SAVs) has been produced. According to the contempo results of the grand Genomes project [1], at that place are approximately 15 million SNPs, and ane million short insertions and deletions, and 20,000 structural variants in the human genome [one], which are nonetheless rapidly increasing. It is estimated that there are 3∼5 meg SAVs in an individual according to the recent sequencing of the whole human genome [ii], [3], [iv], [5].

SAVs, also known equally non-synonymous SNPs (nsSNPs), are the near arable form of unmarried nucleotide polymorphisms (SNPs) that cause amino acrid substitutions in the protein products [six]. Among various SAVs, some may cause deleterious diseases while other amino acrid substitutions are neutral which will not affect the function of the protein. Previous studies on protein structures and functions accept suggested that some SAVs are responsible for certain disease types, and it is reported that about 60% of Mendelian diseases are caused by amino acid substitutions [7]. The information of SAVs tin be used to trace the migration patterns of ancient humans and the ancestry of modern humans. Nonetheless, its near of import awarding may be to interpret the functional result and impact of genomic variation, relating complex interactions with phenotypes and translating these discoveries into medical practices [8]. Therefore, discriminating affliction-associated (i.e. non-neutral) from neutral variants is of great importance in the postal service-genomic era, which can aid understand the genotype/phenotype correlations and develop treatment strategies for diseases. It is besides important to identify whether a SAV is neutral or not-neutral from the disease diagnosis perspective.

In the by few decades, a variety of computational methods take been developed to predict the functional impact of SAVs in a protein [nine], [10], [11], [12], [13], [fourteen], [15], [sixteen], [17]. These methods typically utilise approaches such as statistical rules or motorcar learning algorithms. The input features used by these methods generally include amino acrid sequence, 3D structure, physicochemical properties of amino acids, evolutionary information and complex residue-contact network features. Most of these methods have been implemented as standalone software or webservers to provide academic-free prediction of the functional impact of SAVs to the research community. Most of these methods were developed based on poly peptide sequence analysis, such equally SIFT [18], SNAP [15], PANTHER [nineteen] and PhD-SNP [20]. The consensus of those studies is that sequence features are essential for making the prediction, while 3D structural features could further ameliorate the prediction of disease-associated SAVs when construction data is available.

In this report, we present a novel arroyo for predicting the functional bear upon of SAVs based on a 2-phase random forest algorithm. This approach, termed as FunSAV (Functional result predictor of SAVs) (See Fig. ane for an overview of the methodology), combines a variety of sequence and structural features also as network properties and uses a ii-step efficient characteristic selection to remove the noisy and redundant features in order to narrate the relative importance of each feature type. The final two-phase FunSAV classifier takes every bit input the prediction outputs from the offset-phase classifier and scores from other prediction tools. Extensive comparisons of FunSAV with six other popular tools on the benchmark dataset and another independent test dataset show that this two-stage predictor provides a competitive functioning with most of the tools, illustrating the effectiveness and advantage of this new approach.

thumbnail

Effigy 1. Overview of the FunSAV method for predicting the functional effect of SAVs.

Features used past FunSAV are derived from the amino acid sequence of the protein, 3D structure of the protein, as well equally network properties which are calculated based on the representation of the protein structure every bit a remainder-residue contact network. A full list of the extracted features is given in Table 1. After characteristic selection, distinguishable features between disease-associated and neutral SAVs are statistically analyzed and used as the input to construct RF models. Prediction performance is evaluated past both v-fold cross-validation and independent tests.

https://doi.org/x.1371/journal.pone.0043847.g001

Materials and Methods

Datasets

We retrieved the disease-associated and neutral SAVs to compile a structural benchmark dataset of homo genetic variants [21]. Get-go, disease-associated variants were extracted from the UniProt [22] homo sequence variations (release 2010_11 equally of 02 November 2010) where variants were divided into three categories: disease, polymorphism or unclassified. Disease-associated variants were further filtered by removing not-Mendelian affliction variants that have not been assigned any MIM number from the OMIM database [23]. Neutral variants were taken from the Ensembl human variation database [24] (version 59_37d). In this report, we just extracted the verified SAVs by the HapMap project [25] to construct a high-quality benchmark dataset. Cd-hit [26] was then used to cluster protein sequences and reduce sequence homology in the dataset at the sequence identity (SI) level of 40%, in order to minimize the dataset bias introduced by homologues. All the sequences in the initial dataset were further mapped to the PDB database [27] by Boom search [28]. All the NMR structures and the structures solved by 10-Ray diffraction with resolutions lower than two.five Å were excluded. Details of how to map the locations of variants onto the corresponding PDB structure can be found in previous work [21]. Next, cryptic and conflicting annotations of the disease-associated and neutral variant entries were removed. Finally, we obtained a dataset with 679 protein structures containing i,056 disease-associated and 992 neutral SAVs, with a roughly counterbalanced ratio of one:i. We randomly chose 865 disease-associated and 801 neural SAVs as the benchmark dataset and the rest comprising of 191 disease-associated and 191 neutral SAVs as the independent dataset in order to validate our method.

Feature Extraction

Sequence or sequence-derived features.

Nosotros derived a diversity of dissimilar sequence features that take proved useful in previous studies of the functional effect prediction of SAVs. These include: (1) position-specific scoring matrices (PSSMs) generated by PSI-BLAST [28]; (2) predicted secondary construction by PSIPRED [29]; (3) predicted solvent accessibility by the SSpro program from the SCRATCH package [30]; (4) predicted native disorder by DISOPRED [31]; (5) Conservation score extracted from the PSSM generated by PSI-Blast; (vi) PSIC score that represents how likely information technology is for a particular amino acrid to occupy a specific position in protein sequence, calculated by PSIC [32]; (vii) Aggregation backdrop calculated by TANGO [33] were used to describe the balance β-aggregation backdrop at mutation sites [34]. Combination of these sequence-derived features has been shown to be useful for predicting structural or functional properties of proteins in our recent piece of work [ix] and that of others [35], [36], [37], [38].

Structure features.

We used DSSP [39] to extract the secondary structure annotations, including hydrogen bonds, solvent-accessible surface expanse, Cα atom coordinates and courage torsion angles. The number of hydrogen bonds was calculated by HBPLUS [40].

Conservation score.

Evolutionary conservation is an important concept in bioinformatics. Illness-related mutations are oftentimes observed in evolutionarily conserved positions, as these positions are essential for maintaining the construction or role of the protein [18], [41]. In contrast, neutral variants oftentimes appear in positions that have the potential to be mutated during evolution [41]. Therefore, the conservation score is a critical feature for predicting the part bear on of SAV.

The conservation score tin can exist defined as: where pi,j is the frequency of amino acrid j at position i. These parameters were extracted from the PSSM generated past PSI-Smash. A lower value of the conservation score indicates a college conservation at such position.

Coevolutionary features.

Coevolutionary features take been recently establish useful for identifying important co-evolving residues that are more likely to be disease associated upon mutation [42]. We employed several algorithms and extracted their respective coevolutionary scores as the candidate features. Among them, MI (Common Information) is a quantity that measures the common dependence between ii random variables [43]. MIr [44] is a refined method that normalizes the raw MI value using the pair entropy. MIp [45] is another improved measure which removes the background MI by subtracting APC (Boilerplate Product Correction) from the original MI value. Kai is another method using chi-squared statistical methods [46] to detect remainder co-evolution from sequence alignments.

Residue-contact network features.

They were calculated as follows: Ii residues in a construction volition be defined as in contact if the distance betwixt the centers of them is within 6.v Å. Graph-theoretic approaches from the perspective of residue-residue contact networks is becoming a powerful tool to analyze and predict the functional touch on of SAVs in recent years [47]. In this report, we calculated a number of distinctive residuum-contact network properties that describe the local environment of the mutation balance in the residue-contact network, including degree, closeness, status, hubscore, clustering coefficient, cyclic coefficient, constraint, betweeness, eigenvector, cocitation, coreness and eccentrality.

Solvent accessibility.

Solvent accessibility has been shown to exist a powerful feature in predicting the disease clan [48], [49]. Apart from the predicted solvent accessibility by SSpro from poly peptide sequence, we also used the NACCESS programme [50] to calculate the accented and relative solvent accessibilities of all atoms, full side chain, main concatenation, not-polar side chain and all-polar side chain, respectively.

Solvent exposure features.

New solvent exposure features such as Half-Sphere Exposure were used equally candidate features, which were calculated by the hsexpo program [51]. These include the coordination number (CN), number of Cα atoms in the upper Half-Sphere (HSEAU), number of Cβ atoms in the upper Half-Sphere (HSEBU), number of Cα atoms in the lower Half-Sphere (HSEAD), number of Cβ atoms in the lower One-half-Sphere (HSEBD), residue depth (RD) and atom depth (RDa).

Annotations from database.

Annotations regarding the functional sites of a protein tin be found in the "FT" line in UniProt [22]. Nosotros extracted ix different types of functional annotations: ACT_SITE, BINDING, CA_BIND, DISULFID, DNA_BIND, LIPID, METAL, NP_BIND and MOD_RES.

Prediction scores by other tools.

These include: (i) SIFT score, which was calculated by the SIFT plan that uses sequence homology to predict whether a substitution affects protein function [10], [xviii]; (ii) SNAP score: SNAP is a method that predicts the functional effect of single amino acid substitutions based on neural networks [15]; (three) Polyphen2 score [52]: It is a tool based on Naïve Bayes and its output probability of being variant damaging for a SAV was used as the input feature; (4) PANTHER score [nineteen]: information technology uses Hidden Markov Models (HMMs) to predict the effect of missense SNPs on protein function and can output the probability at which a variant is deleterious; (five) nsSNPAnalyzer [53], which is based on the RF algorithm and outputs the predicted phenotypic class. Nosotros encoded the illness-associated grade as 1 and neutral every bit -one; (6) PhD-SNP [20], which is based on SVM [20] and outputs the predicted phenotypic class. Similarly, we encoded the predicted class into our RF models.

Characteristic vector encoding.

The extracted features are listed in Tabular array 1. We used a sliding window approach with the size of 15 residues to extract the relevant features and used them as the input to build the RF models. In terms of feature nomenclature, each residue was respectively named equally V1, V2, …, V15 according to its position in the local window, while the centered residue was denoted as V8. The elements in the PSSM (with a total dimension of xv×20 = 300) were denoted as V1, V2, …, V300, respectively. Table ii lists the abbreviations of the 15 final selected features used in this report.

thumbnail

Table 1. Features used in this study, which are categorized into nine major types: sequence or sequence-derived, structure, residue-contact network features, computed scores, annotations from database, solvent exposure features, coevolutionary features, solvent accessibilities and conservation score.

https://doi.org/10.1371/journal.pone.0043847.t001

Functioning Evaluation

We used Sensitivity (SEN), Specificity (SPE), Precision (PRE), Accuracy (ACC), the Matthew's correlation coefficient (MCC) and the area under the curve (AUC) to evaluate the predictive functioning of our method.

The Sensitivity (SN) is divers as:

The Specificity (SP) is defined equally:

The Precision (PRE) is defined equally:

The overall Accuracy (ACC) is defined as:

The Matthew's correlation coefficient (MCC) [54] is defined as: where TP is the number of true positives, TN is the number of true negatives, FP is the number of fake positives and FN is the number of faux negatives, respectively.

More specifically, AUC is the area under the receiver operating characteristic (ROC) curve, which is a plot of true positive rate (TPR) confronting false positive rate (FPR). TPR is the ratio of the number of correctly classified disease-associated SAVs to the total number of disease-associated variants, while FPR is the ratio of the number of correctly classified neutral SAVs to the total number of neutral variants.

Feature Choice

We proposed a novel 2-footstep feature selection process to select the well-nigh informative features for predicting the functional effect of SAVs. The first feature selection method is based on the mean subtract Gini index (MDGI), which was calculated by the RF package in R [55]. MDGI is the mean decrease of Gini alphabetize, which is equal to the Gini coefficient multiplied by 100. The Gini coefficient is a measure of inequality of a distribution and is defined as a ratio of the areas on the Lorenz curve diagram [56]. MDGI represents the importance of individual vector element for correctly classifying a SAV equally being disease-associated or neutral. The hateful MDGI Z-Score of each vector element is defined by the following equation: where 10i is the mean MDGI of the i-th feature, is the mean value of all elements of the characteristic x and σ is the standard deviation (SD), respectively. In this written report, the vector element with MDGI Z-Score larger than 1.0 was selected every bit an optimal feature candidate (OFC) determined by the MDGI Z-Score.

The 2nd step is a stepwise characteristic choice by preparation and evaluating the corresponding RF classifiers based on 5-fold cross-validation tests. We randomly divided our benchmark dataset into five subsets in each validation step. And so at each cross-validation footstep, four subsets were merged every bit the training fix to train the model, while the remainder subset was singled out every bit the exam gear up to validate the congenital model. This process was repeated 5 times such that each subset was used in the training and validated in the testing. Then the higher up five-fold cantankerous-validation procedure was repeated 100 times. As a result, nosotros calculated the average of predicted scores of RF classifiers, then carried out the performance evaluation.

We performed the stepwise characteristic selection (likewise called backward characteristic selection) by training the RF model with all the initial OFC features in the starting time circular. Then in the second round, one feature would be removed from the initial feature set. In this round, each characteristic would be removed once a time, and all of the 65 combinations (each containing the rest 64 features) were used to train the corresponding RF models whose performance would be afterwards evaluated. If the resulting RF predictor achieved a higher MCC, such feature would be removed and the corresponding combination was used in the next round. This stepwise feature selection procedure continued until MCC no longer increased. In this way, most important and informative features tin can exist systematically identified.

Random Forest

The random forest algorithm was originally developed by Leo Breiman [57] and has been implemented equally the Random Forest package in R [55]. In this study, nosotros designed and synthetic the showtime-phase and two-phase RF models of FunSAV. Specifically, the first-stage RF classifier of FunSAV was trained based on the optimal fifteen features that were selected through a two-step characteristic selection procedure. We further developed a two-stage predictor, which was trained using RF past combining the outputs of the first-stage classifier and the scores from 6 other tools SIFT, SNAP, PolyPhen2, nsSNPAnalyzer, PANTHER and PhD-SNP.

Results and Word

Optimal Feature Candidate Selection

It is well known that efficient feature selection can significantly improve the prediction performance of machine learning-based classifiers. Furthermore, feature choice can be used to select the near relevant and informative features that contribute to the success of a classifier by reducing the initial high-dimensional feature infinite to a lower but more meaty one. In this work, we selected 15 optimal features that were shown to better distinguish illness-associated from neutral SAVs to train the first-stage FunSAV classifier based on the constructed benchmark dataset.

These fifteen final optimal features were selected by two sequent steps. In the kickoff pace, the mean MDGI Z-Scores of all the 1804 initial features (run across Tabular array S1 for a full list) were calculated by RF and the relative importance of these features was sorted and evaluated. Equally a outcome, 65 features with the hateful MDGI Z-Score >i.0 were selected as OFCs. The relative importance and ranking of the optimal characteristic groups are given in Figure 2. Amongst them, the feature with the highest hateful MDGI Z-Score (>ix.0) is the solvent accessibility characteristic calculated by NACCESS. Solvent exposure features and DSSP_ACC also have larger MDGI Z-Scores, while network and co-development features have moderate MDGI Z-Scores ranging from 1.0 to 4.0 (Fig. ii).

thumbnail

Figure two. The relative importance and ranking of the optimal characteristic group, as evaluated by the mean MDGI Z-Score.

The bar represents the mean MDGI Z-Score of the corresponding feature group. NACCESS: solvent accessibilities calculated by NACCESS [l]; exposure: solvent exposure features calculated by the biopython package [51]; network: residual-contact network features calculated by the JUNG library available at http://jung.sourceforge.net/; PSSM: PSSM features calculated past PSI-BLAST [28]; co-evolution: coevolutionay features including MIr, MIp, MI and Kai value; DSSP_ACC: the number of water molecules in contact with the residue of interest extracted from DSSP [39]; conserve_score: conservation score defined in the Feature extraction Section; SSpro: solvent accessibility calculated past the SSpro program [thirty]; MW_change: Mass weight change upon mutation; B_factor: the temperature cistron extracted from the PDB file; DISOPRED: predicted native disorder by DISOPRED [31].

https://doi.org/x.1371/journal.pone.0043847.g002

The second pace is to further select more of import features stepwisely. Figure S1 shows the operation of RF-based classifiers in terms of MCC past gradually incorporating stepwise selected features. The mean values, standard deviations of the 15 finally selected features and the P-values indicating the statistical significance between the disease-associated and neutral SAVs are provided in Tabular array S2. Information technology can exist seen that four types of remainder-contact network features and four types of solvent exposure features were included in the final feature set. The bulk of the finally selected features are descriptors of the centered mutation residuum (denoted as V8), including the solvent accessibility calculated by NACCESS, conservation score, SSpro (i.due east. binary classification of relative solvent accessibility equally exposed or cached), exposure_HSEBD and exposure_RD. Still, other features that depict the neighboring residues of the mutation position were also included in the terminal feature set. These include network_status_V1, network_status_V7 and network_status_V9, where V1, V7 and V9 denote neighboring residues at positions surrounding the centered mutation residue V8. These bespeak that descriptors of neighboring residues of the variants too play an important role in discriminating disease-associated from neutral SAVs.

Our 2-footstep feature selection is like to that of Ebina et al. [58]. The major divergence is that they used SVM in the kickoff step and RF in the second step to build their classifiers, whereas we used RF consistently in both steps. Another difference is that they removed or added private features by dividing OFC into ii subsets and examining the resulting performance of the classifiers, while we performed a less fourth dimension-consuming backforward feature selection from the whole set up of OFCs in the second footstep. By and large speaking, this ii-step feature selection has ii attractive advantages: (1) Information technology provides a realistic way for selecting an optimal subset of features with an acceptable computational burden [58] compared with other computationally intensive feature option methods. The latter often rely on trial and error experiments to select the nearly relevant features from a relatively minor gear up of arbitrarily selected features; (2) Although the stepwise feature selection does not necessarily crave an exhaustive search and may overlook certain effective combinations of candidate features, it manages to evaluate a sufficient number of feature combinations and results in one of the best combinations.

We also assessed the prediction operation past combining dissimilar feature groups that correspond to unlike ranges of MDGI Z-scores. The resulting performances are given in Table S3. Our stepwise characteristic selection was performed using a set of 65 features with Z-Score>1.0. After feature selection, the RF classifier based on the final optimal feature fix attained the MCC value of 0.510 from the initial value of 0.458. In the meanwhile, the number of selected features decreased from 65 to 15. The results point that stepwise feature selection is constructive at identifying more important and informative features. After the removal of redundant and less informative features with feature option based on the Z-scores, we can efficiently amend the performance of the RF-based classifiers.

Feature Importance and Contribution

In this department, we elaborated on the 15 finally selected optimal features. We compared the MDGI Z-scores of these features and performed the unpaired ii-sample t-test (Table S2). The t-test is a statistical test of whether the mean values of a given feature betwixt the two sources (i.e. disease-associated and neutral SAVs) are equal and thus evaluates the potential of such characteristic in the discrimination of the 2 sample sets. The results are illustrated in Figure iii. Information technology tin be seen that for most of the selected features, the mean values between the disease-associated and neutral SAVs are significantly different, with the P-value <<0.01. The only exception is that the MW_change characteristic has a P-value of 0.0289.

thumbnail

Figure 3. Comparing of the mean values and standard deviations of the 15 optimal features of disease-associated and neutral SAVs.

"*" represents a P-value in the range of 0.01∼0.05, "**" represents a P-value in the range of 2.2e-sixteen∼0.01, while "***" represents a P-value<2.2e-sixteen, respectively. See Tabular array ii for more than details near feature abbreviations.

https://doi.org/ten.1371/journal.pone.0043847.g003

Previous studies have found solvent accessibilities to be powerful features for improving the operation [34]. In this study, we confirm that the about important and contributive features are related to solvent accessibility, including the solvent accessibility feature calculated by NACCESS, DSSP_ACC, solvent exposure features, and the SSpro score which is a descriptor of binary burial status. Nosotros note that most features with higher Z-Scores belong to the solvent accessibility characteristic group (Fig. two). Amid them, SSpro score is the most important characteristic in terms of the contribution to the performance improvement (See Table S4 and Fig. 4). If the SSpro characteristic was removed from the final feature fix with 15 optimal features, the MCC of the resulting classifier would dramatically decrease from 0.510 to 0.474. Moreover, the classifier that was trained using just the SSpro feature achieved an MCC of 0.337, which is the highest value amid all the individual classifiers trained based on singular optimal features (Tabular array S4). This observation is consequent with previous studies that advise disease-associated SAVs were more frequently observed in buried sites [59]. We detect that for affliction-associated SAVs this characteristic is significantly unlike from that of neutral SAVs (P-value<two.2e-xvi). Although SSpro primarily predicts solvent accessibility from sequences information, its prediction performance has also benefited from the incorporation of loftier-quality structural templates [xxx]. Therefore, the prediction of SSpro substantially relies on an constructive combination of both the complementary sequence and structural information. This is particularly advantageous and has an of import implication for improving the grooming quality of machine learning predictors to larn the circuitous sequence-structure-function human relationship of proteins. Thus, inclusion of this feature in the classifier is useful for improving the operation. Farther analysis of solvent accessibility features calculated by NACCESS revealed a different trend of disease-associated and neutral SAVs. Neutral SAVs have college NACCESS scores on average than illness-associated SAVs (Fig. 3). This means that disease-associated variants are more likely to occur at positions with lower solvent accessibility compared with neutral variants, that is, they tend to exist relatively deeply cached in the structure. Nosotros calculated the solvent accessibilities of the total (all atoms), not-polar side chain, polar side chain, total side chain and main concatenation using NACCESS [fifty]. Later on stepwise feature selection, only the solvent accessibility feature of the non-polar side concatenation was retained in the terminal characteristic gear up, which was calculated based on all non-oxygens and non-nitrogens in the side concatenation. The solvent accessibility of all non-oxygens and not-nitrogens in the side chain is more than important than other solvent accessibility features. A possible explanation is that the atoms (the oxygen or nitrogen) in the side concatenation play an important part in forming interactions with other residues of the poly peptide and the water molecules. These interactions among the side-chain atoms, other residues and solvent molecules are often critical for the functionality of the protein.

thumbnail

Figure 4. Upshot of the removal or inclusion of the fifteen individual optimal features on the prediction functioning of the get-go-stage FunSAV classifier.

Functioning was evaluated using MCC. A: Performance of the trained classifier using the individual feature; B: MCC decrease of the trained classifier past removal of the corresponding feature. See Tabular array ii for more than details nigh characteristic abbreviations.

https://doi.org/10.1371/journal.pone.0043847.g004

We noticed several solvent exposure features that were non employed in previous studies merely were found to be useful for the prediction. These include four solvent exposure features selected in the final optimal feature set, including HSEBD, RD, HSEBU and CN. These features have distinctive distribution tendencies between disease-associated and neutral SAVs, i.e. college for disease-associated SAVs and lower for neutral SAVs (Fig. 3). For example, disease-associated SAVs take relatively higher RD values, which ways that they are more likely to appear in the inner layer of the poly peptide. CN is a feature that calculates the number of Cα atoms within a sphere around the Cα atom of the centered residual, which has been shown to be correlated with the change in protein stability (measured by the gratuitous free energy of unfolding) [51]. HSEBU and HSEBD were calculated by dividing the sphere into two half spheres and later counting the numbers of neighboring residues in each half sphere. Although these four features take an inter-correlation, they accept a good complementarity and thus collectively make a contribution to the operation improvement.

Some other of import feature that was not found useful in previous studies is co-development. It refers to a phenomenon induced past the demand of maintaining the structure and/or role of a poly peptide during its evolution. We find that neutral SAVs have relatively college co-evolution values than disease-associated SAVs. This suggests that neutral variants are more likely to exist involved in co-evolution, while affliction-associated SAVs are more than conserved in the evolution procedure. The MCC of the classifier trained using the co-evolution feature is 0.167, and the MCC subtract of the resultant classifier after removal of this feature is 0.016 (Table S4), which is a moderate decrease compared to the other xiv private classifiers. This shows that co-evolution is likewise is a relatively important feature to distinguish affliction-associated SAVs from neutral SAVs.

The B-gene of poly peptide crystal structures is a characteristic that tends to be overlooked in the functional consequence prediction of SAVs. In this piece of work, the B-factor was selected in the final subset of 15 features from the initial 1804 features. It reflects the fluctuation of atoms about their average positions and contains important information near protein dynamics [sixty]. It tin be seen from Figure iii that the B-factor of V7 position is significantly different betwixt illness-associated and neutral SAVs. The neutral SAVs have college B-factor values than the disease-associated SAVs, suggesting that the V7 position of neutral SAVs fluctuate more illness-associated SAVs. There may be one possible reason to explain this. Since proteins are composed of consecutive polypeptide backbones and V7 position is very close to the V8 position where the mutation actually takes place, the fluctuation of V7 position also reflects the fluctuation of V8 position and hence was selected every bit i of the important terminal features. However, the reason why the position is V7 rather than V9 is not articulate to us. Information technology may be considering that the Cα cantlet of V8 residuum is closer to V7 than V9 residue, as the lengths of the Cα-N and Cα-C bonds are 0.145 and 0.152 nm, respectively. Therefore, Cα is closer to V7 than V9, and as a outcome the V7 position has a greater influence on the variant than the V9 position. Our report likewise revealed the significance of residue-contact network features for predicting the functional effect of SAVs. A number of features have been previously used to predict disease-associated SAVs, such as degree, clustering coefficient, betweenness and closeness [47]. Here, we included and examined more residue-contact network features. After characteristic selection, four such features, i.e. Status.V1, Status.V7, Condition.V9 and Closeness.V7 were selected in the concluding feature set of 15 optimal features. They belong to ii generic categories of network properties: clossness and status. Both describe the geodesic distances between the vertex of involvement and all other vertices within the residue-contact graph of a protein chain.

thumbnail

Figure half-dozen. Prediction examples of the functional effect of SAVs in ii proteins by FunSAV.

(A) and (B) the all-cantlet; (C) and (D) surface; (E) and (F) network representations of proteins hATR (PDB ID: 2IDX, chain A) and PAF-AH (PDB ID: 3D59, chain A), respectively. Cerise color denotes disease-associated variants while green color represents neutral variants. 3D structures were rendered using PyMol [71] and network graphs were fatigued using Cytoscape [72].

https://doi.org/10.1371/journal.pone.0043847.g006

More specifically, closeness is a centrality measure of a vertex which describes the status of a residual located in the entire protein structure [47] where highly cardinal residues accept higher closeness values [61]. Such residues interact with a large number of other residues. Previous studies show that closeness can exist effectively used to place functionally of import residues [61], [62] and disease-associated SAVs can be identified by a higher closeness mensurate [47]. Our results in this study are in good agreement with these studies. Nevertheless, nosotros find the status to be an additional useful feature for the prediction, which was not previously recognized. It represents the sum over all geodesic distances between the residues of interest and all other residues in the residual-contact graph. From Figure iii, we can encounter that neighboring residues of illness-associated SAVs including V1, V7 and V9 have on average lower status values than neutral SAVs. The relationship between clossness and status can be expressed as , where N is the number of edges within the residuum-contact graph. As the disease-associated SAVs can be identified by a higher closeness measure, this means that they have lower status values than neutral SAVs. From the definition of status ( ), we tin can see that the status value is determined by two important factors: the distance between the residuum i and j, and the number of residues (i.eastward. N) in the residue-contact graph. In other words, the reason why neighboring residues of affliction-associated SAVs have lower status is considering either (1) the altitude between the contacted residues is shorter than that of neutral SAVs neighbors; (ii) the neighboring residues of affliction-associated SAVs are located on the periphery of the structure and appropriately have a smaller Due north and thus a smaller condition. Altogether, the closeness and status features of neighboring residues of the mutant residues (such equally V7, V9 and V1 positions) were selected every bit important residue-contact network features in the last characteristic fix. The reason why just the network properties of neighboring residues rather than the mutant residue itself were selected might exist that these residuum-contact network features reflect the interactions between different neighboring residues surrounding the centered residues and they can provide sufficient information of the disquisitional local microenvironment of the mutant residue to improve the functioning of RF classifiers.

thumbnail

Figure 7. Prediction example of the false negative of the functional result of SAVs by FunSAV for the Noggin protein.

(A) The all-atom; (B) surface; (C) network representations of the Noggin protein. Scarlet color denotes the disease-associated variant. 3D structures were rendered using PyMol [71] and network graphs were drawn using Cytoscape [72].

https://doi.org/10.1371/journal.pone.0043847.g007

Prediction Functioning of FunSAV Classifiers

In this report, nosotros chose to employ RF instead of SVM as the classifier in that RF has been shown to outperform SVM in the prediction of functional impact of SAVs [63], [64] and RF classifiers exercise not involve time-consuming parameter optimization process and is thus much faster to railroad train the classifiers than SVM. The fifteen optimal features were used to build the kickoff-stage FunSAV classifier, which produced a probability score of a SAV being disease associated or non. This score was so combined with the prediction scores from six other popular tools SNAP, SIFT, PolyPhen2, nsSNPAnalyzer, PANTHER and PhD-SNP, and used as the input to train the two-stage classifier. As some variant data could non exist predicted by PANTHER or nsSNPanalyzer in the analysis, we performed 5-fold cross-validation tests and evaluated the performance of each of the classifiers using a subset of the benchmark dataset for which PANTHER and nsSNPanalyzer generated valid predictions (Run across Table three and Fig. 5A). The offset-phase classifier achieved the highest MCC of 0.535 compared with other individual classifiers and the second highest AUC of 0.824, which is only lower than PolyPhen2 and is better than other five tools. Both PolyPhen2 and our method used structural features. These results point that when structure is bachelor, incorporation of structural features are critical for improving the functioning of predicting functional impacts of SAVs.

We congenital an integrated classifier by combining the prediction scores of the six tools (PolyPhen2+SIFT+SNAP+nsSNPAnalyzer+PANTHER+PhD-SNP) and achieved an MCC of 0.540 and an AUC of 0.852. Moreover, nosotros incorporated the prediction output of the offset-phase FunSAV classifier with prediction scores from SIFT, SNAP, PolyPhen2, nsSNPAnalyzer, PANTHER and PhD-SNP to build a two-stage FunSAV classifier. As a result, the prediction performance was significantly improved, with MCC increased from 0.535 to 0.598, and AUC from 0.824 to 0.882, respectively. Although the SNAP, SIFT and PolyPhen2 achieved higher sensitivity (93.2, 80.half dozen and 87.9% for SNAP, SIFT and PolyPhen2, respectively), they had lower specificity (44.1, 66.5 and 61.8%, respectively). In contrast, the get-go-stage FunSAV classifier achieved a balanced sensitivity and specificity (77.two and 76.3%, respectively), while the two-stage FunSAV classifier achieved a sensitivity of 79.7% and a specificity of 80.ane%, respectively. We also evaluated the prediction performances of the first-stage and two-stage FunSAV classifiers based on another independent exam dataset. The results are given in Table S5 and Figure 5B. We built the final FunSAV classifier by combining the showtime-stage classifier with the scores of all vi other tools SIFT, SNAP, PolyPhen2, nsSNPAnalyzer, PANTHER and PhD-SNP. The prediction performance of this new classifier is more robust compared to other classifiers, and has outperformed the first-phase FunSAV classifier and the other 6 individual tools on the independent test dataset. Equally a event, AUC appropriately increased from 0.793 to 0.872, and MCC increased from 0.482 to 0.606, both of which are the overall best performance.

Example Written report

To further illustrate the effectiveness of FunSAV for identifying illness-associated from neutral variants, we present a case study of three proteins that incorporate both disease-associated and neutral variants in this section. The offset two proteins tested are non nowadays in our criterion dataset for edifice the FunSAV classifiers. FunSAV correctly identified the functional effect of all the variants in the get-go 2 proteins. The third poly peptide is provided as an illustration of the false negatives generated past FunSAV.

The showtime example is the human being ATP: cobalamin adenosyltransferase (hATR) [65]. This enzyme catalyzes the last step in the conversion of vitamin B12 to the human being cofactor adensosylcobalamin. Mutations in hATR result in the metabolic disorder, known as methylmalonic aciduria (MMA). The variant M239K (dbSNP: rs9593) is a neutral commutation, while the variants E193K and I96T result in methylmalonic aciduria (MMA), an inborn error of metabolism due to the impaired isomerization of L-methymalonyl-CoA to succinyl CoA during the oxidation of propionate towards the TCA cycle [66]. From Figure 6A, we can see that M239 is relatively exposed at the surface of the protein, while E193K and I96T are relatively buried in the structure. Table S6 as well indicates that disease-associated variants are located in cached area. The neutral variant M239 has a fewer number of interacting residues, while the affliction-associated variants E193K and I96T have more densely connected edges with other neighboring residues in residuum-contact network (Fig. 6E). Hence, mutations at these positions tend to disrupt the local residue-contact network and thus are more than likely to cause illness.

The second example is the human plasma platelet-activating factor (PAF) acetylhydrolase (PAF-AH) [67]. It reduces PAF levels by functioning as a general anti-inflammatory scavenger and is linked to anaphylactic daze, asthma, and allergic reactions. The variants (V279F: dbSNP: rs16874954 and Q281R) volition result in a loss of plasma PAF-AH activity that accounts for 4% of the Japanese population. The polymorphic site R92H is a neutral variant, which upon mutation is more likely to exhibit phenotypic differences through interactions with lipoproteins or other binding partners [67]. Like to the variants in the above example, the variant R92H is also solvent-exposed and located in the outer layer of the structure (Fig. 6B), with fewer interactions with other residues compared with the other two disease-associated variants V279F and Q281R. The latter ii variants are securely buried in the inner layer of the structure and accordingly grade highly connected residue-interacting networks. These case studies suggest that FunSAV is an effective tool for identifying functional impacts of SAVs.

The third example is the Noggin whose primary physiological role is to antagonize the action of bone morphogenetic proteins (BMP) [68]. The antagonist Noggin tin can bind to BMP and inhibit BMP signaling by blocking the molecular interfaces of the binding epitopes. The residue R204 of Noggin can form ion pairs with E48 of BMP. The variant R204L will cause tarsal/carpsal coalition syndrome (TCC), because it disrupts the ion pair with E48 of BMP. From Figure seven, it can be seen that the variant R204L was located at the surface of Noggin, and the values of some important features selected in the final feature set are more than close to the mean of neutral SAVs (listed in Table S2) than that of disease-associated SAVs, for example, the exposure features (See Table S6 for more detail). In such cases, it would be more difficult for FunSAV to correctly predict its functional effect, while other software such as SIFT, SNAP, PolyPhen2, nsSNPAnalyzer, PANTHER and PhD-SNP could correctly predict the functional impact of this variant. Therefore, inclusion of more relevant features that describe the interactions of the protein of interest with other interaction partners may prove to be an effective fashion to further improve the performance of FunSAV.

Conclusions

Nosotros developed FunSAV, a new bioinformatics tool based on the random forest algorithm to predict the functional effect of SAVs. All-encompassing 5-fold cross-validation and independent tests demonstrate that FunSAV has achieved a better performance compared with six other competitive tools. The performance comeback of FunSAV tin be attributed to the combination of four critical factors: (i) utilise of high-quality counterbalanced structural dataset; (ii) classifier trained based on a large characteristic set with a variety of important and complementary features, including sequence, structure, network and other types of features that draw the local environments proximal to the centered variant and neighboring residues; (iii) efficient feature choice to remove noisy and redundant features to foreclose overfitting and (4) training of robust two-stage RF classifiers in combination with scores past other tools. We show that information technology is especially useful to build meliorate classifiers with improved performance through efficient feature selection from a big initial set of various features, and integration with scores by other tools. To make an accurate prediction, FunSAV requires the 3D construction of the protein where SAVs were located, which may limit its broader awarding. However, with the increasing availability of target structures solved past structural genomics initiatives, genome-broad protein 3D modeling projects [69] and predicted 3D structures [70], it is expected that FunSAV can be used as a powerful tool to prioritize the illness-associated variants and assist towards the functional notation of these targets.

Supporting Data

Effigy S1.

The characteristic option bend in stepwise feature option describes the functioning change (in terms of MCC) of gradual inclusion of individual features to the trained classifiers. MCC_FSS (characteristic pick stepwise, FSS) indicates the MCC change in this stepwise characteristic selection process.

https://doi.org/x.1371/journal.pone.0043847.s001

(TIF)

Table S1.

All initial 1804 features used in this study. "OFC" indicate that such feature was selected as the 65 optimal characteristic candidates (OFCs), while "FINAL" indicates that such feature was selected every bit 1 of the 15 final optimal features.

https://doi.org/10.1371/journal.pone.0043847.s002

(DOC)

Table S2.

The hateful values and standard deviations of the 15 final selected optimal features for the disease-associated and neutral SAVs. Mean: hateful value; SD: standard deviation. P-value was calculated using the unpaired ii-sample t-test.

https://doi.org/10.1371/periodical.pone.0043847.s003

(Doc)

Author Contributions

Conceived and designed the experiments: JS. Performed the experiments: WM. Analyzed the information: WM XMZ JS. Contributed reagents/materials/analysis tools: MW XMZ KT HX YL TA. Wrote the paper: MW JS.

References

  1. 1. Durbin RM, Altshuler D, Abecasis GR, Bentley DR (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073.
  2. two. Cline MS, Karchin R (2011) Using bioinformatics to predict the functional impact of SNVs. Bioinformatics 27: 441–448.
  3. 3. Schuster SC, Miller W, Ratan A, Tomsho LP, Giardine B, et al. (2010) Complete Khoisan and Bantu genomes from southern Africa. Nature 463: 943–947.
  4. iv. Ahn SM, Kim Th, Lee S, Kim D, Ghang H, et al. (2009) The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res 19: 1622–1629.
  5. 5. Wang J, Wang W, Li R, Li Y, Tian G, et al. (2008) The diploid genome sequence of an Asian private. Nature 456: 60–65.
  6. six. Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, et al. (2004) The Swiss-Prot variant page and the ModSNP database: A resources for sequence and structure information on man protein variants. Man Mutation 23: 464–470.
  7. 7. Botstein D, Risch N (2003) Discovering genotypes underlying human being phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet 33 Suppl: 228–237
  8. 8. Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB (2011) Bioinformatics challenges for personalized medicine. Bioinformatics 27: 1741–1748.
  9. ix. Wang One thousand, Shen H-B, Akutsu T, Song J (2011) Predicting functional touch on of single amino acrid polymorphisms by integrating sequence and structural features. The 5th IEEE International Briefing on Systems Biological science (ISB 2011). Zhuhai, China: IEEE eXpress Briefing Publishing. 18–26.
  10. x. Ng PC, Henikoff South (2001) Predicting deleterious amino acrid substitutions. Genome Res 11: 863–874.
  11. 11. Bork P, Sunyaev S, Hanke J, Aydin A, Wirkner U, et al. (1999) Prediction of nonsynonymous xsingle nucleotide polymorphisms in homo disease-associated genes. J Mol Med 77: 754–760.
  12. 12. Cargill Thousand, Altshuler D, Ireland J, Sklar P, Ardlie K, et al. (1999) Label of unmarried-nucleotide polymorphisms in coding regions of human being genes Nat Genet. 22: 231–238.
  13. 13. Chasman D, Adams RM (2001) Predicting the functional consequences of non-synonymous unmarried nucleotide polymorphisms: construction-based assessment of amino acrid variation. J Mol Biol 307: 683–706.
  14. 14. Wang Z, Moult J (2001) SNPs, protein structure, and disease. Human Mutation 17: 263–270.
  15. 15. Bromberg Y, Yachdav G, Rost B (2008) SNAP predicts issue of mutations on protein function. Bioinformatics 24: 2397–2398.
  16. 16. Wainreb G, Ashkenazy H, Bromberg Y, Starovolsky-Shitrit A, Haliloglu T, et al. (2010) MuD: an interactive web server for the prediction of non-neutral substitutions using protein structural data. Nucleic Acids Res 38: 7869–7869.
  17. 17. Gao S, Zhang Due north, Duan GY, Yang Z, Ruan JS, et al. (2009) Prediction of function changes associated with single-point protein mutations using support vector machines (SVMs). Hum Mutat 30: 1161–1166.
  18. 18. Ng PC, Henikoff S (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res 31: 3812–3814.
  19. 19. Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, et al. (2003) PANTHER: a library of protein families and subfamilies indexed by function. Genome Res thirteen: 2129–2141.
  20. twenty. Capriotti E, Calabrese R, Casadio R (2006) Predicting the insurgence of human genetic diseases associated to unmarried point protein mutations with support vector machines and evolutionary data. Bioinformatics 22: 2729–2734.
  21. 21. Gong S, Blundell TL (2010) Structural and functional restraints on the occurrence of single amino acid variations in human being proteins. PloS 1 5: e9186.
  22. 22. Bairoch A, Apweiler R, Wu C, Barker W, Boeckmann B, et al. (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res 33: D154–159.
  23. 23. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33: D514–D517.
  24. 24. Flicek P, Amode MR, Barrell D, Aggravate K, Brent S, et al. (2011) Ensembl 2011. Nucleic Acids Res 39: D800–D806.
  25. 25. The International HapMap Consortium (2007) A second generation homo haplotype map of over three.1 one thousand thousand SNPs. Nature 449: 851–861.
  26. 26. Li W, Godzik A (2006) Cd-hit: a fast plan for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22: 1658–1659.
  27. 27. Berman HM, Westbrook J, Feng Z, Gilliland Yard, Bhat TN, et al. (2000) The Poly peptide Data Depository financial institution. Nucleic Acids Res 28: 235–242.
  28. 28. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, et al. (1997) Gapped Blast and PSI-Nail: a new generation of poly peptide database search programs. Nucleic Acids Res 25: 3389–3402.
  29. 29. Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292: 195–202.
  30. xxx. Cheng J, Randall AZ, Sweredoski MJ, Baldi P (2005) SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res 33: W72–W76.
  31. 31. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and Functional Analysis of Native Disorder in Proteins from the Iii Kingdoms of Life. J Mol Biol 337: 635–645.
  32. 32. Sunyaev SR, Eisenhaber F, Rodchenkov IV, Eisenhaber B, Tumanyan VG, et al. (1999) PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. Protein Eng 12: 387–394.
  33. 33. Fernandez-Escamilla A-K, Rousseau F, Schymkowitz J, Serrano 50 (2004) Prediction of sequence-dependent and mutational furnishings on the assemblage of peptides and proteins. Nat Biotech 22: 1302–1306.
  34. 34. Ye ZQ, Zhao S-Q, Gao M, Liu X-Q, Langlois RE, et al. (2007) Finding new structural and sequence attributes to predict possible disease clan of single amino acid polymorphism (SAP). Bioinformatics 23: 1444–1450.
  35. 35. Song J, Tan H, Takemoto K, Akutsu T (2008) HSEpred: predict half-sphere exposure from protein sequences. Bioinformatics 24: 1489–1497.
  36. 36. Song J, Tan H, Shen H, Mahmood 1000, Boyd SE, et al. (2010) Cascleave: towards more than authentic prediction of caspase substrate cleavage sites. Bioinformatics 26: 752–760.
  37. 37. Song J, Yuan Z, Tan H, Huber T, Burrage Yard (2007) Predicting disulfide connectivity from poly peptide sequence using multiple sequence characteristic vectors and secondary structure. Bioinformatics 23: 3147–3154.
  38. 38. Zhao XM, Cheung YM, Huang DS (2005) A novel approach to extracting features from motif content and protein limerick for poly peptide sequence classification. Neural Netw eighteen: 1019–1028.
  39. 39. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: Design recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 2577–2637.
  40. forty. McDonald IK, Thornton JM (1994) Satisfying hydrogen bonding potential in proteins. J Mol Biol 238: 777–793.
  41. 41. Miller MP, Kumar Southward (2001) Understanding human being illness mutations through the use of interspecific genetic variation. Hum Mol Genet 10: 2319–2328.
  42. 42. Kowarsch A, Fuchs A, Frishman D, Pagel P (2010) Correlated mutations: a authentication of phenotypic amino acid substitutions. PLoS Comput Biol 6: e1000923.
  43. 43. Korber BT, Farber RM, Wolpert DH, Lapedes AS (1993) Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic assay. Proc Natl Acad Sci USA ninety: 7176–7180.
  44. 44. Martin LC, Gloor GB, Dunn SD, Wahl LM (2005) Using information theory to search for co-evolving residues in proteins. Bioinformatics 21: 4116–4124.
  45. 45. Dunn SD, Wahl LM, Gloor GB (2008) Mutual information without the influence of phylogeny or entropy dramatically improves rest contact prediction. Bioinformatics 24: 333–340.
  46. 46. Larson SM, Di Nardo AA, Davidson AR (2000) Analysis of covariation in an SH3 domain sequence alignment: applications in tertiary contact prediction and the pattern of compensating hydrophobic cadre substitutions. J Mol Biol 303: 433–446.
  47. 47. Li YZ, Wen ZN, Xiao JM, Yin H, Yu LZ, et al. (2011) Predicting disease-associated substitution of a unmarried amino acid by analyzing residue interactions. BMC Bioinformatics 12: 14.
  48. 48. Dobson R, Munroe P, Caulfield M, Saqi M (2006) Predicting deleterious nsSNPs: an analysis of sequence and structural attributes. BMC Bioinformatics vii: 217.
  49. 49. Saunders CT, Baker D (2002) Evaluation of Structural and Evolutionary Contributions to Deleterious Mutation Prediction. J Mol Biol 322: 891–901.
  50. 50. Hubbard SJ, Thornton JM (1993) 'NACCESS'. Computer Programme: Section Biochemistry and Molecular Biological science, Academy College, London.
  51. 51. Hamelryck T (2005) An amino acid has two sides: A new 2D measure provides a different view of solvent exposure. Proteins 59: 38–48.
  52. 52. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, et al. (2010) A method and server for predicting damaging missense mutations. Nat Meth 7: 248–249.
  53. 53. Bao L, Zhou M, Cui Y (2005) nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res 33: W480–W482.
  54. 54. Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta-Poly peptide Structure 405: 442–451.
  55. 55. Liaw A, Wiener M (2002) Classification and Regression by randomForest. R news two: 18–22.
  56. 56. Sadras V, Bongiovanni R (2004) Apply of Lorenz curves and Gini coefficients to assess yield inequality inside paddocks. Field Crops Research xc: 303–310.
  57. 57. Breiman L (2001) Random forests. Mach larn 45: five–32.
  58. 58. Ebina T, Toh H, Kuroda Y (2011) Driblet: an SVM domain linker predictor trained with optimal features selected past random forest. Bioinformatics 27: 487–494.
  59. 59. Stitziel NO, Tseng YY, Pervouchine D, Goddeau D, Kasif Due south, et al. (2003) Structural location of illness-associated single-nucleotide polymorphisms. J Mol Biol 327: 1021–1030.
  60. 60. Yuan Z, Bailey TL, Teasdale RD (2005) Prediction of protein B-factor profiles. Proteins 58: 905–912.
  61. 61. Amitai G, Shemesh A, Sitbon East, Shklar K, Netanely D, et al. (2004) Network assay of protein structures identifies functional residues. J Mol Biol 344: 1135–1146.
  62. 62. Wang XF, Chen Z, Wang C, Yan RX, Zhang Z, et al. (2011) Predicting residual-residual contacts and helix-helix interactions in transmembrane proteins using an integrative characteristic-based random forest arroyo. PLoS One 6: e26767.
  63. 63. Bao 50, Cui Y (2005) Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary data. Bioinformatics 21: 2185–2190.
  64. 64. Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, et al. (2009) Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics 25: 2744–2750.
  65. 65. Schubert HL, Hill CP (2006) Construction of ATP-leap human ATP:cobalamin adenosyltransferase. Biochemistry 45: 15188–15196.
  66. 66. Martínez MA, Rincón A, Desviat LR, Merinero B, Ugarte M, et al. (2005) Genetic analysis of three genes causing isolated methylmalonic acidemia: identification of 21 novel allelic variants. Mol Genet Metab 84: 317–325.
  67. 67. Samanta U, Bahnson BJ (2008) Crystal structure of human plasma platelet-activating factor acetylhydrolase: structural implication to lipoprotein binding and catalysis. J Biol Chem 283: 31617–31624.
  68. 68. Groppe J, Greenwald J, Wiater Eastward, Rodriguez-Leon J, Economides AN, et al. (2002) Structural ground of BMP signalling inhibition by the cystine knot poly peptide Noggin. Nature 420: 636–642.
  69. 69. Schwede T, Kopp J, Guex N, Peitsch MC (2003) SWISS-MODEL: An automatic protein homology-modeling server. Nucleic Acids Res 31: 3381–3385.
  70. 70. Hardin C, Pogorelov Goggle box, Luthey-Schulten Z (2002) Ab initio poly peptide structure prediction. Curr Opin Struct Biol 12: 176–181.
  71. 71. Delano WL (2002) The PyMOL Molecular Graphics System. DeLano Scientific LLC San Carlos, CA.
  72. 72. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, et al. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13: 2498–2504.
  73. 73. Punta Yard, Coggill PC, Eberhardt RY, Mistry J, Tate J, et al. (2012) The Pfam protein families database. Nucleic Acids Res xl: D290–301.

trichecatir1943.blogspot.com

Source: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0043847

0 Response to "Polymorphisms How to Read Variants in Amino Acids"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel