3,762 Views
•
09:34 min
•
September 25, 2021
DOI:
A variety of biological sequence classification task, such as species classification, gene function classification, and wire host classification are expected processes in many metagenomic data analyzes. Since metagenomic data contain a large number of Novo species and genes, high performing classification organisms are needed in many studies. Biologists often encounter challenge in finding suitable sequence classification and notation tools for a specific task and are often not able to construct a corresponding organism on their own because of a lack of the necessary mathematical and computational knowledge.
Deep learning techniques have recently become a popular topic and show strong advantage in many classification tasks. To date, many highly packaged deep learning package, which make it possible for biologists to construct deep learning frameworks, according to their own needs without in depth knowledge of the organism details have been developed. In this tutorial, we provide a guideline for constructing an easy to use deep learning framework for sequence classification without the need for sufficient mathematical knowledge or programming skills.
The following video shows how to use the virtual machine to perform biological sequence classification. Users need to download the virtual machine file from the tutorial homepage, and then download the VirtualBox software. The virtual machine is compressed as a seventy file.
The seventy file can easily be decompressed using a current compressing software, such as WinRar, Winzip, and 7-Zip. We decompressed the virtual machine using 7-Zip. The decompression may take some time.
Please wait for awhile. After decompression users need to install the VirtualBox software. Create a folder to install the VirtualBox.
Create a VirtualBox installation package. Select the folder created by yourself. Then install the VirutalBox software by clicking the next button in each step.
The installation may take some time, please wait for awhile. Open the VirtualBox software. Create a new button to create a virtual machine.
Enter the virtual machine name specified by yourself in the name frame. Select Linux as the operating system in the type frame. Select Ubuntu in the version frame and click the next button.
If possible, allocate a larger amount of memory to the virtual machine. True the use an existing hard disk file selection. Select the virtual machine file downloaded from the tutorial homepage.
And then click the create button. Click a start button to open the virtual machine. Starting up the virtual machine may take a while.
Please wait for a moment before the next step. Then users need to create shared folder in both physical hosts and virtual machine to exchange files. In your physical host, create a shared folder named shared host and on the desktop of the virtual machine, create a shared folder named shared VM.In the manual bar of the virtual machine, click devices, shared folders, shared folder settings successively.
Click the button in the upper right corner. Select the shared folder in the physical host created by yourself. Select the auto mount option.
Click the OK button. Then restart the virtual machine. Restarting the virtual machine may take awhile.
Please wait for a moment before the next step. Click the right click on the desktop of the virtual machine and open the terminal. Type the following command to the terminal.
Sudo, space key, mount, space key, bar T, space key, vboxsf, space key, shared host, space key, dot slash, desktop, slash, shared VM.When prompted for a password, enter one and tap the enter key. Copy all four sequence files in faster format for the training and testing process to the shared host folder of the physical host. In this way, all the files will also occur in the shared VM folder of the virtual machine.
Then copy the files in the shared VM folder to the deep learning folder of the virtual machine. Click the right click and open the terminal and type the following command to perform the one hot encoding. Dot slash, one hot encoding, specify the files for training and testing.
And specify the sequence type. Then type the following command to start the trending process. Python space key, train dot P Y.Then the trending process will begin.
This process may take a few hours or a few days, depending on your data set size. When the process is finished, the predict result of the test data is present in the predict dot CSV file. In our previous work, we developed a series of sequence classification tools for a metagenomic data, using an approach similar to this tutorial.
For example, we developed a tool aimed to identify the complete and partial prokaryote virus virion proteins from run data. And a tool aimed to identify phage DNA fragments from bacterial chromosome DNA fragments in metogenomic data. The performance of the tools using the script of this tutorial is shown in the figure a and b.
In conclusion, this tutorial provides an overview for biologist and organisms design beginners on how to construct an easy to use deep learning framework for biological sequence classification in metogenomic data. This tutorial aims to provide intuitive understanding of deep learning and address the challenge that beginners often have difficulty in starting the deep learning package and writing the code for the organism. For some simple classification tasks, users can use our framework to perform the classification task.
This tutorial describes a simple method to construct a deep learning algorithm for performing 2-class sequence classification of metagenomic data.
Read Article
Cite this Article
Fang, Z., Zhou, H. A Virtual Machine Platform for Non-Computer Professionals for Using Deep Learning to Classify Biological Sequences of Metagenomic Data. J. Vis. Exp. (175), e62250, doi:10.3791/62250 (2021).
Copy