This tutorial describes a simple method to construct a deep learning algorithm for performing 2-class sequence classification of metagenomic data.
A variety of biological sequence classification tasks, such as species classification, gene function classification and viral host classification, are expected processes in many metagenomic data analyses. Since metagenomic data contain a large number of novel species and genes, high-performing classification algorithms are needed in many studies. Biologists often encounter challenges in finding suitable sequence classification and annotation tools for a specific task and are often not able to construct a corresponding algorithm on their own because of a lack of the necessary mathematical and computational knowledge. Deep learning techniques have recently become a popular topic and show strong advantages in many classification tasks. To date, many highly packaged deep learning packages, which make it possible for biologists to construct deep learning frameworks according to their own needs without in-depth knowledge of the algorithm details, have been developed. In this tutorial, we provide a guideline for constructing an easy-to-use deep learning framework for sequence classification without the need for sufficient mathematical knowledge or programming skills. All the code is optimized in a virtual machine so that users can directly run the code using their own data.
The metagenomic sequencing technique bypasses the strain isolation process and directly sequences the total DNA in an environmental sample. Thus, metagenomic data contain DNA from different organisms, and most biological sequences are from novel organisms that are not present in the current database. According to different research purposes, biologists need to classify these sequences from different perspectives, such as taxonomic classification1, virus-bacteria classification2,3,4, chromosome-plasmid classification3,5,6,7, and gene function annotation (such as antibiotic resistance gene classification8 and virulence factor classification9). Because metagenomic data contain a large number of novel species and genes, ab initio algorithms, which do not rely on known databases for sequence classification (including DNA classification and protein classification), are an important approach in metagenomic data analysis. However, the design of such algorithms requires professional mathematics knowledge and programming skills; therefore, many biologists and algorithm design beginners have difficulty constructing a classification algorithm to suit their own needs.
With the development of artificial intelligence, deep learning algorithms have been widely used in the field of bioinformatics to complete tasks such as sequence classification in metagenomic analysis. To help beginners understand deep learning algorithms, we describe the algorithm in an easy-to-understand fashion below.
An overview of a deep learning technique is shown in Figure 1. The core technology of a deep learning algorithm is an artificial neural network, which is inspired by the structure of the human brain. From a mathematical point of view, an artificial neural network may be regarded as a complex function. Each object (such as a DNA sequence, a photo or a video) is first digitized. The digitized object is then imported to the function. The task of the artificial neural network is to give a correct response according to the input data. For example, if an artificial neural network is constructed to perform a 2-class classification task, the network should output a probability score that is between 0-1 for each object. The neural network should give the positive object a higher score (such as a score higher than 0.5) while giving the negative object a lower score. To obtain this goal, an artificial neural network is constructed with the training and testing processes. During these processes, data from the known database are downloaded and then divided into a training set and test set. Each object is digitized in a proper way and given a label ("1" for positive objects and "0" for negative objects). In the training process, the digitized data in the training set are inputted into the neural network. The artificial neural network constructs a loss function that represents the dissimilarity between the output score of the input object and the corresponding label of the object. For example, if the label of the input object is "1" while the output score is "0.1", the loss function will be high; and if the label of the input object is "0" while the output score is "0.1", the loss function will be low. The artificial neural network employs a specific iterative algorithm that adjusts the parameters of the neural network to minimize the loss function. The training process finishes when the loss function cannot be obviously further decreased. Finally, the data in the test set are used to test the fixed neural network, and the ability of the neural network to calculate the correct labels for the novel objects is evaluated. More principles of deep learning algorithms can be found in the review in LeCun et al.10.
Although the mathematical principles of deep learning algorithms may be complex, many highly packaged deep learning packages have recently been developed, and programmers can directly construct a simple artificial neural network with a few lines of code.
To assist biologists and algorithm design beginners in getting started in using deep learning more quickly, this tutorial provides a guideline for constructing an easy-to-use deep learning framework for sequence classification. This framework uses the "one-hot" encoding form as the mathematical model to digitize the biological sequences and uses a convolution neural network to perform the classification task (see the Supplementary Material). The only thing that the users need to do before using this guideline is to prepare four sequence files in "fasta" format. The first file contains all sequences of the positive class for the training process (referred to "p_train.fasta"); the second file contains all sequences of the negative class for the training process (referred to "n_train.fasta"); the third file contains all sequences of the positive class for the testing process (referred to "p_test.fasta"); and the last file contains all sequences of the negative class for the testing process (referred to "n_test.fasta"). The overview of the flowchart of this tutorial is provided in Figure 2, and more details will be mentioned below.
This tutorial provides an overview for biologists and algorithm design beginners on how to construct an easy-to-use deep learning framework for biological sequence classification in metagenomic data. This tutorial aims to provide intuitive understanding of deep learning and address the challenge that beginners often have difficulty installing the deep learning package and writing the code for the algorithm. For some simple classification tasks, users can use the framework to perform the classification tasks.
<p class…The authors have nothing to disclose.
This investigation was financially supported by the National Natural Science Foundation of China (81925026, 82002201, 81800746, 82102508).
PC or server | NA | NA | Suggested memory: >6GB |
VirtualBox software | NA | NA | Link: https://www.virtualbox.org |