1Center for Computational Medicine and Bioinformatics, University of Michigan, 2Center for Bioinformatics and Department of Molecular Bioscience, University of Kansas
Roy, A., Xu, D., Poisson, J., Zhang, Y. A Protocol for Computer-Based Protein Structure and Function Prediction. J. Vis. Exp. (57), e3259, doi:10.3791/3259 (2011).
Genome sequencing projects have ciphered millions of protein sequence, which require knowledge of their structure and function to improve the understanding of their biological role. Although experimental methods can provide detailed information for a small fraction of these proteins, computational modeling is needed for the majority of protein molecules which are experimentally uncharacterized. The I-TASSER server is an on-line workbench for high-resolution modeling of protein structure and function. Given a protein sequence, a typical output from the I-TASSER server includes secondary structure prediction, predicted solvent accessibility of each residue, homologous template proteins detected by threading and structure alignments, up to five full-length tertiary structural models, and structure-based functional annotations for enzyme classification, Gene Ontology terms and protein-ligand binding sites. All the predictions are tagged with a confidence score which tells how accurate the predictions are without knowing the experimental data. To facilitate the special requests of end users, the server provides channels to accept user-specified inter-residue distance and contact maps to interactively change the I-TASSER modeling; it also allows users to specify any proteins as template, or to exclude any template proteins during the structure assembly simulations. The structural information could be collected by the users based on experimental evidences or biological insights with the purpose of improving the quality of I-TASSER predictions. The server was evaluated as the best programs for protein structure and function predictions in the recent community-wide CASP experiments. There are currently >20,000 registered scientists from over 100 countries who are using the on-line I-TASSER server.
Following the sequence-to-structure-to-function paradigm, the I-TASSER procedure1-4 for structure and function modeling involves four consecutive steps of: (a) template identification by LOMETS5; (b) fragment structure reassembly by replica-exchange Monte Carlo simulations6; (c) atomic level structure refinement using REMO7 and FG-MD8; and (d) structure-based function interpretations using COFACTOR9.
Template identification: For a query sequence submitted by the user, the sequence is first threaded through a representative PDB structure library by a locally installed LOMETS meta-threading server. Threading is a sequence-structure alignment procedure used for identifying template proteins which may have similar structure or contain similar structural motif as the query protein. To increase the coverage of homologous template detections, LOMETS combines multiple state-of-the-art algorithms covering different threading methodologies. Since different threading programs have different scoring systems and alignment sensitivities, the quality of the generated threading alignments from each threading program is assessed by normalized Z-score, which is defined as:
where Z-score is the score in standard deviation units relative to the statistical mean of all alignments generated by the program; and Z0 is a program-specific Z-score cutoff determined based on large-scale threading benchmark tests5 to differentiate 'good' and 'bad' templates. A template with a high Z-score means that the top templates have an alignment score significantly higher than most of other templates, which usually implies that the alignment corresponds to a good model. If most of the top threading templates have high normalized Z-scores, the accuracy of the final I-TASSER model is usually high. However, if the protein is large and the coverage of the threading alignments is confined to a small region of the query protein, a high normalized Z-score does not necessarily mean a high modeling accuracy for the full-length model. Top two threading alignments from each threading program are collected and used for the next step of structure assembly.
Iterative structure assembly simulation: Following the threading procedure, query sequence is split into threading aligned and unaligned regions. Continuous fragments in threading alignment are excised from templates and used directly for the structure assembly, while the unaligned loop regions are built by ab initio modeling. The structure assembly procedure is performed on a lattice system guided by the replica exchange Monte Carlo simulations6. The I-TASSER force field includes hydrogen-bonding interactions10, knowledge-based statistical energy terms derived from known protein structures in the PDB11, sequence-based contact predictions from SVMSEQ12, and spatial restraints collected from LOMETS5 threading templates. The conformational decoys generated in the low-temperature replicas during the simulations are clustered by SPICKER13 to identify structures of low free-energy states. Cluster centroids of the top clusters are obtained by averaging the 3D coordinates of all clustered structural decoys and used for the final model generation. The simulation and clustering procedure are repeated twice for removing steric clashes and further refining the global topology.
Atomic-level model construction and refinement: The cluster centroids obtained after SPICKER clustering are reduced protein models (each residue represented by its Cα and side-chain center of mass) and have limited biological application. The construction of full-atomic model from the reduced models is done in two steps. In the first step, REMO7 is used to construct full-atomic models from C-alpha traces by optimizing the H-bond networks. In the second step, REMO full-atomic models are further refined by FG-MD14, which improves the backbone torsion angles, bond lengths, and side-chain rotamer orientations, by molecular dynamic simulations, as guided by the structural fragments searched from the PDB structures by TM-align. The FG-MD refined models are used as the final models for tertiary structure predictions by I-TASSER.
The quality of the generated models are estimated based on a confidence score (C-score), which is defined based on the Z-score of LOMETS threading alignments and the convergence of I-TASSER simulations, mathematically formulated as:
where M is the multiplicity of structure decoys in the structural clusters identified by the SPICKER13; Mtot is the total number of decoys submitted to the clustering;
The C-score has a strong correlation with the quality of the I-TASSER models. By combining C-score and protein length, the accuracy of the first I-TASSER models can be estimated with an average error of 0.08 for the TM-score and 2 Å for the RMSD15. In general, models with C-score > - 1.5 are expected to have a correct fold. Here, RMSD and TM-score are both well known measures of topological similarity between the model and native structure. TM-score values range in [0, 1], where a higher score indicates a better structure match16,17. However for lower-ranked models (i.e. 2nd-5th models), the correlation of C-score with TM-score and RMSD is much weaker (~ 0.5), and cannot be used for reliable estimation of absolute model quality.
Is first model always the best model in I-TASSER simulations? The answer to this question depends on the target type. For easy targets, the first model is usually the best model and its C-score is usually much higher than the rest of the models. However, for hard targets, where threading does not have significant template hits, the first model is not necessarily the best model and I-TASSER actually has difficulty in selecting the best template and models. It is therefore recommended to analyze all the 5 models for hard targets and select them based on the experimental information and biological knowledge.
Function predictions: In the last step, final 3D-models generated from FG-MD are used to predict three aspects of protein function, namely: a) Enzyme Commission (EC) numbers18 and (b) Gene Ontology (GO)19 terms and (c) binding sites for small molecule ligands. For all the three aspects, functional interpretations are generated using COFACTOR, which is a new approach to predict protein's function based on global and local similarity to template proteins in PDB with known structure and functions. First, the global topology of the predicted models is matched against functional template libraries using structural alignment program TM-align20. Next, a set of proteins most similar to the target models are selected from the library based on their global structure similarity, and an extensive local search is performed to identify structure and sequence similarity near the active/binding site region. The resultant global and local similarity scores are used to rank the template proteins (functional homologues) and transfer the annotation (EC numbers and Gene Ontology19 terms) based on the top scoring hits. Similarly, ligand binding site residues and the ligand binding mode are inferred based on the local alignment of query with known ligand binding site residues in the top scoring function templates9.
The quality of function (EC and GO term) prediction in I-TASSER is evaluated based on functional homology score (Fh-score) which is a measure of global and local similarity between the query and template, and is defined as:
where C-score is an estimate of the quality of predicted model as defined in Eq. (2); TM-score measures the global structural similarity between the model and template proteins; RMSDali is the RMSD between the model and the template structure in the structurally aligned region from TM-align20; Cov represents the coverage of the structural alignment (i.e. the ratio of the structurally aligned residues divided by the query length); IDali is the sequence identity in the TM-align alignment. The estimated confidence score for EC number predictions also includes a term for evaluating active site match (AcM) between query and template within a defined local region, calculated as:
where Nt represents the number of template residues present within the local area, Nali is the number of the aligned query-template residue pairs, dii is the Cα distance between ith pair of aligned residues, d0 = 3.0 Å is the distance cutoff, Mii is the BLOSUM scores between ith pair of aligned residues. In general, the Fh-score is in the range [0, 5] and AcM score is between [0, 2], where higher scores indicating more confident functional assignments. AcM score is also used for evaluating the local structure and sequence similarity near the ligand-binding sites, which is referred to as BS-score.
1. Submission of protein sequence
2. Availability of results
3. Secondary structure and solvent accessibility predictions
4. Tertiary structure predictions
5. LOMETS target template alignment
6. Structural analogs in PDB
7. Function prediction using COFACTOR
8. Enzyme Commission number prediction
9. Gene Ontology (GO) term predictions
10. Protein-ligand binding site predictions
11. Representative results
Figure 1. An excerpt of I-TASSER result page showing (A) FASTA formatted query sequence; (B) predicted secondary structure and associated confidence scores; and (C) predicted solvent accessibility of the residues. Analyzed core region and potential hydration site in the query are highlighted in cyan and red rectangles, respectively.
Figure 2. An example of I-TASSER result page showing tertiary structure predictions for the query proteins. The predicted models are displayed in an interactive Jmol applet, allowing the user to change the display of the molecule. The models can also be downloaded by clicking on the "Download" links. The confidence score to estimate the quality of the model is reported as C-score.
Figure 3. An example of I-TASSER result page showing top ten identified threading templates and alignments by LOMETS5 threading programs. The quality of the threading alignments is evaluated based on normalized Z-score (highlighted in green), where a value >1 reflects a confident alignment. Aligned residues in the template that are identical to the corresponding query residues are highlighted in color to indicate presence of conserved residue/motif, while a lack of alignment in most of the top templates indicates presence of multiple domains in the query protein and the unaligned residues correspond to domain linker regions. Click here to view the full-sized version of figure 3.
Figure 4. An example of result page showing top ten identified structural analogs and structural alignments, identified by TM-align20 structural alignment program. The ranking of the analogs shown in is based on the TM-score (highlighted in blue) of the structural alignment. A TM-score >0.5 indicates that the two compared structures have a similar topology, while a TM-score <0.3 means a similarity between two random structures. Structurally aligned residue pairs are highlighted in color based on their amino-acid property, while the unaligned regions are indicated by "-". Click here to view the full-sized version of figure 4.
Figure 5. An example of I-TASSER result page showing identified enzyme homologues of the query protein in the PDB library. The confidence level of EC number prediction is analyzed based on EC-score (highlighted in green), where EC-score > 1.1 indicates functional similarity (same first 3 digits of EC number) between query and template protein.
Figure 6. An example of I-TASSER result page showing GO term predictions for the query protein. Functional homologues for the query protein in the Gene Ontology template library are ranked based on their Fh-score (in orange rectangle). Common functional features from these top-scoring hits are derived to generate the final GO term predictions for the query protein. The quality of the predicted GO terms is estimated based on GO-score (shown in green), where a GO-score >0.5 indicates a reliable prediction. Click here to view the full-sized version of figure 6.
Figure 7. An example of I-TASSER result page showing top ten protein ligand binding site predictions using the COFACTOR9 algorithm. The ranking of the predicted binding sites is based on the number of predicted ligand conformations that share common binding pocket in the query. BS-score (highlighted in red) is a measure of local sequence and structure similarity between the predicted and template's binding site; and is useful for analyzing the conservation of binding site pockets.
Figure 8. An example of external restraint files used to for specifying residue-residue contact/distance restraints.
Figure 9. Example of restraint files used for specifying a template protein to the I-TASSER server. User can specify the query-template alignment either in (A) FASTA format; or (B) 3D format.
Figure 10. An example file used for excluding template during the I-TASSER structure modeling procedure. The first column contains the PDB ID of the template proteins to be excluded. The second column is used to specify the sequence identity cutoff which will be used for other similar templates in the template library.
The protocol presented above is a general guideline for structure and function modeling using the I-TASSER server. Although, this automated procedure works very well for most of the proteins, human interventions often help to significantly improve the modeling accuracy, especially for the proteins which lack close templates in the PDB library. Users can intervene during the I-TASSER modeling in the following ways: (a) splitting of multi-domain proteins; (b) providing external restraints to improve the structure assembly; and (c) removing templates during the modeling.
Splitting multi-domain protein:
Many long protein sequences frequently contain multiple domains tethered by flexible linker regions, which makes their structure elucidation difficult using both experimental and computational techniques. Nevertheless, as domains are independently folding entities and can perform distinct molecular function; it is desirable to split long multi-domain proteins and model each domain separately. Modeling domains individually will not only speed up the prediction process, but also increases the quality of query-template alignment, resulting in more reliable structure and function predictions.
Domain boundaries in protein sequences can be predicted using freely available external online programs such as NCBI CDD24, PFAM25 or InterProScan26. Also, if LOMETS threading alignments are available for the query protein, domain boundaries can be located by visually identifying long stretches of unaligned residues in the top threading templates (see Step 5.4). These unaligned regions mostly correspond to domain linker regions. If multi-domain templates are already available in the template PDB library with all the query domains aligned, then the query protein can be modeled as full length.
Provide external restraints
The structure assembly simulations in I-TASSER are mainly guided by spatial restraints collected from the LOMETS threading templates. For query proteins that have good threading hit (Norm. Z-score > 1) in the template library, derived spatial restraints are mostly of high accuracy and I-TASSER will generate high resolution structural models for these proteins. Contrarily, for query proteins that have weak or no threading hit (Norm. Z-score < 1), collected spatial restraints often contain errors because of the uncertainty of the template and the alignment. For these protein targets, user-specified spatial information can be very helpful to improve the quality of the predicted model. Users can provide external restraints to the I-TASSER server in two ways:
A. Specify contact/distance restraints
Experimentally characterized inter-residue contacts/distances, for example from NMR or cross-linking experiments, can be specified by uploading a restraint file. An example file is shown in Figure 8, where Column 1 specifies the type of restraint, i.e. "DIST" or "CONTACT". For distance restraint (DIST), columns 2 and 4 contain residue positions (i, j), columns 3 and 5 contain the atom-types in the residue and column 6 specifies the distance between the two specified atoms. For contact restraints (CONTACT), columns 2 and 3 contain the positions (i, j) of residues which should be in contact. The distance between the side chains center of these contacting residue pairs is decided based on observed distances in known structures in PDB. I-TASSER will try to draw these atom pairs close to the specified distance during the structure refinement simulations.
B. Specify a protein structure template
LOMETS threading programs use a representative PDB library to find plausible folds for the query protein. Although using a representative structure library helps to reduce the time required to compute the sequence-structure alignments, it is possible that a good template protein is missed in the library or the template may not have been identified by LOMETS threading programs, even though it is present in the library. In these cases, the user should specify the desired protein structure as the template.
To specify protein structure as an additional template, users can either upload a PDB formatted structure file or specify the PDB ID of a deposited protein structure in PDB library. The I-TASSER will generate the query-template alignment using MUSTER program23 and will collect spatial restraints from both the user specified template and LOMETS templates to guide the structure assembly simulation. Because the accuracy of the LOMETS restraints is different for different targets, the weight of the LOMETS restraints is stronger in easy (homologous) targets than that in hard (non-homologous) targets, which have been systematically tuned in our benchmark training.
Users can also specify their own query-template alignments. The server accepts alignment in two formats: the FASTA format (Figure 9A) and the 3D format (Figure 9B). The FASTA format is standard and described at http://zhanglab. ccmb.med.umich.edu/FASTA/. The 3D format is similar to the standard PDB format (http://www.wwpdb.org/documentation/format32/sect9.html), but two additional columns derived from the templates are added to the ATOM records (see Figure 9B):
Columns 1-30: Atom (C-alpha only) and residue names for the query sequence.
Columns 31-54: Coordinates of C-alpha atoms of the query copied from the corresponding atoms in the template.
Columns 55-59: Corresponding residue number in the template based on alignment
Columns 60-64: Corresponding residue name in the template
Exclude templates proteins
Proteins are flexible molecules and can adopt multiple conformational states to change their biological activity. For example, structures of many protein kinases and membrane proteins have been solved in both active and inactive conformation. Also presence or absence of bound ligand can cause large structural movements. While all the conformational states of the template are alike for the threading programs, it is desirable to model the query using templates in only one particular state. A new option on the server allows the user to exclude template proteins during structure modeling. This feature would also allow the user to choose the homology level of templates to be used for the modeling. Users can exclude template proteins from the I-TASSER library by:
A. Specifying a sequence identity cutoff
Users can use this option to exclude homologous proteins from the I-TASSER template library. The homology level is set based on the sequence identity cutoff, i.e. the number of identical residue between the query and the template protein divided by the sequence length of the query sequence. For example, if the user types in "70%" in the provided form, all templates proteins which have a sequence identity >70% to the query protein I-will be excluded from the I-TASSER template library.
B. Exclude specific template proteins
Specific template proteins can be excluded from the I-TASSER template library by uploading a list containing PDB IDs of the structures to be excluded. An example file is shown in Figure 10. As the same protein can exist as multiple entries in the PDB library, I-TASSER server will by default exclude the specified templates (in Column1) as well as all other templates from the library that have an identity >90% to the specified templates. Users can also specify a different identity cutoff, e.g. 70%, where all templates with identity >70% to specified template proteins will be excluded.
No conflicts of interest declared.
The project is supported in part by Alfred P. Sloan Foundation, NSF Career Award (DBI 1027394), and the National Institute of General Medical Sciences (GM083107, GM084222).
|Material Name||Type||Company||Catalogue Number|
|FASTA formatted amino acid sequence of the protein to be modeled (see, http://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml).|
|A personal computer with access to the internet and a web browser.|
|Molecular visualizing software, e.g. RASMOL or PYMOL, for analyzing the predicted tertiary structure and functional sites.|