MSLoc-DT: A new method for predicting the protein subcellular location of multispecies based on decision templates.
Revealing the subcellular location of newly discovered protein sequences can bring insight to their function and guide research at the cellular level. The rapidly increasing number of sequences entering the genome databanks has called for the development of automated analysis methods. Currently, most existing methods used to predict protein subcellular locations cover only one, or a very limited number of, species. Therefore, it is necessary to develop reliable and effective computational approaches to further improve the performance of protein subcellular prediction and, at the same time, cover more species. The present study reports the development of a novel predictor called MSLoc-DT to predict the protein subcellular locations of human, animal, plant, bacteria, virus, fungi and archaea by introducing a novel feature extraction approach termed Amino Acid Index Distribution (AAID) and then fusing gene ontology information, sequential evolutionary information and sequence statistical information through four different modes of pseudo amino acid composition (PseACC) with a decision template rule. Using the jackknife test, MSLoc-DT can achieve 86.5%, 98.3%, 90.3%, 98.5%, 95.9%, 98.1% and 99.3% overall accuracy for human, animal, plant, bacteria, virus, fungi and archaea, respectively, on seven stringent benchmark datasets. Compared with other predictors (e.g., Gpos-PLoc, Gneg-PLoc, Virus-PLoc, Plant-PLoc, Plant-mPLoc, ProLoc-Go, Hum-PLoc and GOASVM) on the Gram-positive dataset, Gram-negative dataset, Virus dataset, Plant dataset, Eukaryotic dataset and Human dataset, the new MSLoc-DT predictor is much more effective and robust. Although the MSLoc-DT predictor is designed to predict the single location of proteins, our method can be extended to multiple locations of proteins by introducing multilabel machine learning approaches, such as the support vector machine or deep learning, as substitutes for the K-Nearest Neighbor (KNN) method. As a user-friendly web server, MSLoc-DT is freely accessible at http://bioinfo.ibp.ac.cn/MSLOC_DT/index.html.