Metagenomic Analysis of Silage

Metagenomics is defined as the direct analysis of deoxyribonucleic acid (DNA) purified from environmental samples and enables taxonomic identification of the microbial communities present within them. Two main metagenomic approaches exist; sequencing the 16S rRNA gene coding region, which exhibits sufficient variation between taxa for identification, and shotgun sequencing, in which genomes of the organisms that are present in the sample are analyzed and ascribed to "operational taxonomic units"; species, genera or families depending on the extent of sequencing coverage. In this study, shotgun sequencing was used to analyze the microbial community present in cattle silage and, coupled with a range of bioinformatics tools to quality check and filter the DNA sequence reads, perform taxonomic classification of the microbial populations present within the sampled silage, and achieve functional annotation of the sequences. These methods were employed to identify potentially harmful bacteria that existed within the silage, an indication of silage spoilage. If spoiled silage is not remediated, then upon ingestion it could be potentially fatal to the livestock.


Introduction
Metagenomics is the direct analysis of DNA purified from biological communities found within environmental samples 1 and was originally used to detect unculturable bacteria found in sediments 2 . Metagenomics has been widely used for a number of applications, such as identifying the human microbiome 3 , classifying microbial populations within the ocean 4 and even for the analysis of the bacterial communities that develop on coffee machines 5 . The introduction of next generation sequencing technologies resulted in greater sequencing throughput and output. Consequently, DNA sequencing has become more economical 6 and the depth of sequencing that can be performed has greatly increased, enabling metagenomics to become a powerful, analytical tool.
"Front-end" enhancements in the practical, molecular aspect of metagenomic sequencing have driven the growth of the in silico bioinformatics tools available for the taxonomic classification [7][8][9] , functional annotation 10,11 and visual representation 12,13 of DNA sequence data. The increasing number of available, sequenced prokaryotic and eukaryotic 14 genomes allows further accuracy in the classification of microbial communities, which are invariably performed against a "back-end" reference database of sequenced genomes 15 . Two main approaches can be adopted for metagenomic analysis.
The more conventional method is analysis of the 16S rRNA gene coding region of bacterial genome. The 16S rRNA is highly conserved between prokaryote species but exhibits nine hyper-variable regions (V1 -V9) which can be exploited for species identification 16 . The introduction of longer sequencing (≤ 300 bp paired end) allowed for the analysis of DNA sequences spanning two hyper-variable regions, in particular the V3 -V4 region 17 . Advances in other sequencing technologies, such as Oxford Nanopore 18 and PacBIO 19 , do allow the entire 16S rRNA gene to be sequenced contiguously.
While 16S rDNA based libraries provide a targeted approach to species identification and enable the detection of low copy number DNA that naturally occurs within purified samples, shotgun sequencing libraries allow for the detection of species that may contain DNA regions that are either not amplifiable by the 16S rRNA marker primer sequences used, or because the differences between the template sequence and the amplifying primer sequence are too great 20,21 . Furthermore, although DNA polymerases have a high fidelity of DNA replication, base errors can nonetheless occur during PCR amplification and these incorporated errors can result in incorrect classification of originating species 22 .
Biases in the PCR amplification of template sequences can also occur; sequences of DNA with a high GC content can be under represented in the final amplicon pool 23 and similarly unnatural base modifications, such as thymine glycol, can halt DNA polymerases causing failures in the amplification of DNA sequences 24 . In contrast, a shotgun sequencing DNA library is a DNA library that has been prepared by using all of the purified DNA that has been extracted from a sample and subsequently fragmented into shorter DNA chain lengths prior to preparation for sequencing. Taxonomic classification of DNA sequences generated by shotgun sequencing is more accurate when compared to 16S rRNA amplicon sequencing 25 , although the financial cost required to reach a reliable sequencing depth is greater than that of amplicon sequencing . Metagenomic sequence data is analyzed by an ever-increasing range of bioinformatic tools. These tools are able to perform a wide variety of applications, for example, quality control analysis of the raw sequence data 28 , overlapping of paired end reads 29 , de novo assembly of sequence reads to contigs and scaffolds 30,31 , taxonomic classification and visualization of sequence reads and assembled sequences 7,12,32,33 and the functional annotation of assembled sequences 34,35 . Silage, produced by farmers throughout the world from fermented cereals such as maize (Zea mays), is predominately used as cattle feed. Silage is treated with the bacterium Lactobacillus sp. to aid fermentation 36 but to date, there is limited knowledge of the other microbial populations found in silage. The fermentation process can lead to undesirable and potentially harmful micro-organisms becoming prevalent within the silage 37 . In addition to yeasts and molds, bacteria are particularly adaptable to the anaerobic environment in fermenting silage and are more frequently associated with diseases in livestock rather than the degradation of the silage 38 . Butyric acid bacteria can be inadvertently added from soil remains when filling the silage silos and are able to convert the lactic acid, a product of anaerobic digestion, to butyric acid, thus increasing the pH of the silage 39 . This increase in pH can lead to an upsurge in spoilage bacteria that would normally be unable to sustain growth under optimum silage fermentation conditions 38 . Clostridium spp., Listeria spp. and Bacillus spp. are of particular concern, especially in silage for dairy cattle feed, as bacterial spores that have survived the gastrointestinal tract 40 can enter the food-chain, lead to food spoilage and, in rare cases, to animal and human fatalities 37,39,[41][42][43][44] . Moreover, while it is difficult to estimate the exact economic impact of veterinary treatment and livestock loss caused by silage spoilage, it is likely to be detrimental to a farm if an outbreak was to occur.
It is hypothesized that by using a metagenomic approach we can classify the microbial populations that are present in silage samples and furthermore identify microbial communities associated with silage spoilage that would, in turn, potentially have a detrimental effect on the livestock, enabling remedial action to be taken before the silage is to be used as a food source. button. Upload the assembled scaffolds from Step 10. 2. Once the files have uploaded, click on "Submit" and follow the instructions and await the completion of analysis. 3. After the analysis is complete, view the link sent via email from MG-RAST, or alternatively, click on "Progress". There is a list of completed jobs. Click on the relevant job id and then on the link to the "download page". 4. On the download page, under the heading "Protein Clustering 90%", click on the protein button to download the predicted protein file, 550.cluster.aa90.faa. 5. To classify the proteins as putatively belonging to a particular CAZy enzyme class, compare the downloaded proteins to the CAZy database 48 . Download the Carbohydrate-Active enZYmes Database (CAZy) from files are: AA.zip, CE.zip, GH.zip, GT.zip and PL.zip. These files represent the following enzyme classes respectively: Auxiliary Activities (AA), Carbohydrate Esterases (CE), Glycoside Hydrolases (GH), Glycosyl Transferases (GT) and Polysaccharide Lyases (PL). 6. Unzip the database files and annotate the proteins by determining the protein similarity to the CAZy database proteins using the USEARCH UBLAST algorithm 49 . To use a bash loop (for i in *.txt) to iterate through the 5 database .txt files type "for i in *.txt; do". 7. Run USEARCH by typing /path-to-file/usearch8 with the parameter -ublast in order to use the ublast algorithm. Then type in the name of the protein sequence file downloaded from MG-RAST, "mgmXXXXXX.3.550.cluster.aa90.faa". 8. To indicate the database file to be used type "-db $i" and to specify the E-value threshold at 1e -5 , type "-evalue 1e-5". 9. To terminate the search after the discovery of a target sequence and therefore classifying that protein sequence as belonging to the target enzyme class, e.g. GH, type "-masaccepts 1". 10. To define that 16 CPUs should be used type "-threads 16" and to specify the format of the output file as atab-separated text type "-blast6out".

Discussion
While an in silico analysis can give an excellent insight to the microbial communities that are present within environmental samples, it is critical that the taxonomic classifications demonstrated be performed in association with relevant controls and that a suitable depth of sequencing has been achieved to capture the entire population present 51 .
With any computational analysis, there are many routes to achieve a similar goal. The methods that we have used in this study are examples of suitable and straightforward methods, that have been brought together to achieve a range of analyses on the silage microbiome. A variety and an ever-increasing number of bioinformatics tools and techniques are available to analyze metagenomic data, for instance Phylosift 8 and MetaPhlAn2 52 , and these should be evaluated prior to the investigation for their relevance to the sample and the analysis required 53 . Metagenomic analysis methods are limited by the databases for available for classification, sequencing depth and the quality of sequencing.
The bioinformatic processing demonstrated here was performed on a local, high powered machine; however cloud-based systems are also available. These cloud-based services allow for the rental of the necessary computational power without having the high-cost investment of a suitable powerful local workstation. A potential application of this method would be to assess silage before its use in agriculture to ensure that no potentially harmful bacteria are present therefore preventing them from entering the food chain.