Tick Microbiome Characterization by Next-Generation 16S rRNA Amplicon Sequencing

Lisa Couper; Andrea Swei

doi:10.3791/58239

Biology

Tick Microbiome Characterization by Next-Generation 16S rRNA Amplicon Sequencing

Published: August 25, 2018 doi: 10.3791/58239

Lisa Couper¹, Andrea Swei¹

¹Department of Biology, San Francisco State University

Summary

Here we present a next-generation sequencing protocol for 16S rRNA sequencing which enables identification and characterization of microbial communities within vectors. This method involves DNA extraction, amplification and barcoding of samples through PCR, sequencing on a flow-cell, and bioinformatics to match sequence data to phylogenetic information.

Abstract

In recent decades, vector-borne diseases have re-emerged and expanded at alarming rates, causing considerable morbidity and mortality worldwide. Effective and widely available vaccines are lacking for a majority of these diseases, necessitating the development of novel disease mitigation strategies. To this end, a promising avenue of disease control involves targeting the vector microbiome, the community of microbes inhabiting the vector. The vector microbiome plays a pivotal role in pathogen dynamics, and manipulations of the microbiome have led to reduced vector abundance or pathogen transmission for a handful of vector-borne diseases. However, translating these findings into disease control applications requires a thorough understanding of vector microbial ecology, historically limited by insufficient technology in this field. The advent of next-generation sequencing approaches has enabled rapid, highly parallel sequencing of diverse microbial communities. Targeting the highly-conserved 16S rRNA gene has facilitated characterizations of microbes present within vectors under varying ecological and experimental conditions. This technique involves amplification of the 16S rRNA gene, sample barcoding via PCR, loading samples onto a flow cell for sequencing, and bioinformatics approaches to match sequence data with phylogenetic information. Species or genus-level identification for a high number of replicates can typically be achieved through this approach, thus circumventing challenges of low detection, resolution, and output from traditional culturing, microscopy, or histological staining techniques. Therefore, this method is well-suited for characterizing vector microbes under diverse conditions but cannot currently provide information on microbial function, location within the vector, or response to antibiotic treatment. Overall, 16S next-generation sequencing is a powerful technique for better understanding the identity and role of vector microbes in disease dynamics.

Introduction

The resurgence and spread of vector-borne diseases in recent decades pose a serious threat to global human and wildlife health. Effective vaccines are lacking for a majority of these diseases, and control efforts are hindered by the complex biological nature of vectors and vector-host interactions. Understanding the role of microbial interactions within a vector in pathogen transmission can allow for the development of novel strategies which circumvent these challenges. In particular, interactions between vector-associated microbial commensals, symbionts, and pathogens, referred to as the microbiome, may have important consequences for pathogen transmission. Overwhelming evidence now supports this assertion, with examples demonstrating a link between the vector microbiome and competence for diseases such as malaria, Zika virus, and Lyme disease¹^,²^,³. However, translating these findings into strategies for disease control requires a far more detailed understanding of the structure, function, and origin of vector microbiomes. Identification and characterization of the vector microbial community under varying ecological and experimental conditions constitute an important path forward in this field.

A procedure for identifying the microbial residents of a pathogen vector is provided here by utilizing the Western black-legged tick, Ixodes pacificus, a vector species of the Lyme disease pathogen Borrelia burgdorferi. While ticks harbor more types of human pathogens than any other arthropod, relatively little is known about the biology and community ecology of tick microbiomes⁴. It is evident that ticks harbor a diverse array of viruses, bacteria, fungi, and protozoans which include commensals, endosymbionts, and transient microbial residents⁵^,⁴. Prior work has demonstrated strong variations in Ixodes microbiomes associated with geography, species, sex, life stage, and blood meal source⁶^,⁷^,⁸. However, the mechanisms underlying this variation remain unknown and warrant more detailed investigations of the origin and assembly of these microbial communities. Ticks can acquire microbes through vertical transmission, contact with hosts, and uptake from the environment through the spiracles, mouth, and anal pore⁹. Understanding the factors shaping the initial formation and development of the tick microbiome, specifically the relative contribution of vertical and environmental transmission, is important for understanding the natural patterns and variations in tick microbiome diversity and how these communities interact during pathogen transmission, with possible applications to disease or vector control.

Powerful molecular techniques, such as next-generation sequencing, now exist for identifying microbial communities and can be employed to characterize vector microbiomes under diverse environmental or experimental conditions. Prior to the advent of these high-throughput sequencing approaches, the identification of microbes relied predominantly on microscopy and culture. While microscopy is a rapid and easy technique, morphological methods for identifying microbes are inherently subjective and coarse and limited by low sensitivity and detection¹⁰. Culture-based methods are broadly used for microbial identification and can be used to determine susceptibility of microbes to drug treatments¹¹. However, this method also suffers from low sensitivity, as it has been estimated that fewer than 2% of environmental microbes can be cultured in a laboratory setting¹². Histological staining approaches have also been employed to detect and localize specific microbes within vectors, enable investigations of various taxa distributions within the tick, and study hypotheses about microbial interactions. However, prior knowledge of microbial identity is required for selecting the appropriate stains, making this approach ill-suited for microbial characterization and identification. Furthermore, histological staining is a highly time-intensive, laborious process and does not scale well for large sample sizes. Traditional molecular approaches such as Sanger sequencing are similarly limited in their sensitivity and detection of diverse microbial communities.

Next-generation sequencing allows for the rapid identification of microbes from a large number of samples. The presence of standard marker genes and reference databases further enables enhanced taxonomic resolution, often to the genus or species level. Small subunit ribosomal RNAs are frequently used to achieve this goal, with 16S rRNA being the most common due to the presence of conserved and variable regions within the gene, allowing for the creation of universal primers with unique amplicons for each bacterial species¹³^,¹⁴. This report details a procedure for identifying taxa in the tick microbiome through 16S rRNA next-generation sequencing. In particular, this protocol emphasizes the steps involved in preparing samples for sequencing. More generalized details on the sequencing and bioinformatics steps are provided, as there are a variety of sequencing platforms and analysis programs currently available, each with extensive existing documentation. The overall feasibility of this next-generation sequencing approach is demonstrated by applying it to an investigation of microbial community assembly within a key disease vector.

Subscription Required. Please recommend JoVE to your librarian.

Protocol

1. Tick Collection and Surface Sterilization

Collect ticks by dragging a 1 m² white cloth over a tick-associated habitat, removing ticks attached to host species, or rearing ticks in the lab¹⁵^,¹⁶. Use fine forceps to manipulate ticks and store them at -80 °C.
Place ticks in the individual PCR tubes and remove surface contaminants by vortexing for 15 s successively with 500 μL of hydrogen peroxide (H₂O₂), 70% ethanol, and ddH₂O.
Place the ticks in a new PCR tube and allow them to air-dry.
In this tube, mechanically disrupt tick tissues by crushing the ticks with a mortar and pestle (used in this study), using small beads in a bead beater, or cutting the tick apart with a scalpel.

2. DNA Purification

Purify DNA from individual ticks, following the instructions provided in a commercially available DNA extraction kit. Elute for a final volume of 100 μL.
NOTE: Refer to Figure 1 for an overview of steps 2.2-2.8. Alternative extraction methods include phenol-chloroform extraction¹⁷^,¹⁸, ethanol precipitation¹⁹, or extraction with a chelating material²⁰^,²¹.
Confirm successful DNA purification using a fluorometer or spectrophotometer. For fluorometry, use 10 μL of DNA template in a 190 μL double-stranded DNA assay. The expected yield is 0.1-1.0 ng/μL²². For spectrophotometry, use 1 μL of DNA template in a nucleic acid quantification. The expected A260/280 ratio is approximately 1.8²³.
If not proceeding immediately to amplification, store the samples at -20 °C.

3. 16S rRNA Gene Amplification

Set up the amplicon PCR in a 27.5 μL reaction containing: 5 μL of each primer at 1 μM; 12.5 μL of commercially available PCR mix; and 5 μL of DNA extracted from individual ticks at a 5 ng/μL concentration.
NOTE: Use the following primers for amplification of the hypervariable V3-V4 region of the 16S rRNA gene¹³:
5'--TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG--3',
5'--GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC--3'
Move tubes to a thermocycler programmed for: an initial denaturation at 95 °C for 3 min; 25 cycles of 1) 95 °C for 30 s, 2) 57 °C for 30 s, 3) 72 °C for 30 s; a final extension at 72 °C for 5 min; and a final hold at 10 °C.
Once the PCR is done, visualize the PCR product by loading 4-6 μL/sample on a 1.5% agarose gel. Look for a band at 460 bp to confirm amplification.
NOTE: Amplicon PCR product may not be visible on the gel if the sample concentration is too low (< 10 ng). Primer dimer presence is common in low-concentration samples. If other non-target bands are present, adjust the annealing temperature or decrease the number of cycles.
If not proceeding immediately to purification, store the samples at 4 °C.

4. 16S Amplicon Purification

Using a new PCR tube for each sample, combine the PCR product with ddH₂O to obtain 60 μL total. For samples with low DNA concentrations, perform amplicon PCR in triplicate to reduce amplification bias and pool the samples to concentrate.
Bring paramagnetic beads to room temperature and vortex them well before use.
Add 48 μL of paramagnetic beads to 60 μL of the sample and incubate for 5 min. Place the tubes on a magnetic rack for 5 min until solution becomes clear. Remove the supernatant.
NOTE: This bead concentration targets the removal of primer dimers (< 60 bp). If needed, adjust the bead concentration to remove other non-specific banding.
With tubes on magnetic rack, add 500 μL of freshly-prepared 80% EtOH. Immediately after adding EtOH to all tubes, pipette out the liquid. Do not remove the beads. Repeat the EtOH addition and removal one more time.
Air-dry the samples to remove excess EtOH by leaving the tubes open on the magnetic rack for 5 min, or until small cracks are visible in the beads.
Add 20 μL of TE buffer and incubate the samples off the magnet, at room temperature, for 5-10 min.
Place the tubes back on the magnet. Once the beads and liquid are separated, transfer the supernatant to a fresh tube to obtain the cleaned PCR product.
(Optional) Visualize the product to confirm successful amplicon purification by loading 4-6 μL/sample on a 1.5% gel. The 460 bp band should be visible, and there should be no primer dimer.
If not proceeding immediately to index PCR, store the samples 4 °C.

5. Sample Barcoding and Purification

Assign unique primer combinations to each sample by selecting either forward or reverse primers, or both, from a commercially available library index kit.
NOTE: Uniquely labeling samples allows for differentiation after sequencing. Kits typically provide enough primers to sequence 96-384 samples.
Attach the dual primers, or indices, to samples by performing PCR in a 25 μL reaction containing 2.5 μL of each primer (N7xx and S5xx), 12.5 μL of commercially available PCR master mix, 5 μL of ddH₂O, and 2.5 μL of cleaned amplicon product.
Move the tubes to a thermocycler programmed for: an initial denaturation at 95 °C for 3 min; 8-14 cycles of 1) 95 °C for 30 s, 2) 55 °C for 30 s, 3) 72 °C for 30 s; a final extension at 72 °C for 5 min; and a final hold at 10 °C.
Once the PCR is done, visualize the PCR product by loading 4-6 μL/sample on a 1.5% agarose gel. Look for a band at 550 bp to confirm amplification.
NOTE: Use the visualization results to inform index PCR cycling conditions. Lower the cycle count to mitigate non-specific binding or increase the cycle count to obtain visible bands for each sample.
Repeat the clean-up procedure listed in step 4. To avoid dilution during clean-up, perform index PCR in duplicate and pool the product here.
If not proceeding immediately to library quantification and normalization, store the samples at 4 °C.

6. Library Quantification and Normalization

Estimate the concentration of each purified, barcoded product from step 5.5 using a fluorometer or spectrophotometer (see step 2.2). Dilute the samples in TE buffer to obtain concentrations of approximately 1 pM.
NOTE: Library quantification is necessary to achieve the sample loading concentration recommended for the sequencing platform. Library quantification is achieved through qPCR, but estimating the sample concentration prior to qPCR saves reagents and time. The 1 pM concentration is recommended to maximize the accuracy of library quantification, as this is the mean concentration of the qPCR standards provided in the quantification kit²⁴.
Perform qPCR in a 10 μL reaction containing 6.0 μL of qPCR master mix (from library quantification kit), 2.0 μL of ddH₂O, and 2.0 μL of the sample or standard. Run each sample and standard in triplicate for greater accuracy and precision. Run each sample at three or more dilution levels (e.g., 0.1 pm, 1 pm, and 10 pm for estimated starting concentrations) to ensure accurate quantification.
NOTE: Detailed information about the provided primers and standards will vary based on the quantification kit used and are available in the products technical data sheet. Refer to the Table of Materials for the quantification kit used in this study.
Move the qPCR plate or tubes to a real-time PCR instrument programmed for: an initial denaturation of 95 °C for 5 min; 35 cycles of 1) 95 °C for 30 s and 2) 60 °C for 45 s; and a dissociation step of 1) 95 °C for 15 s, 2) 60 °C for 30 s, and 3) 95 °C for 15 s.
Calculate the average starting concentration for each sample, using the quantification values obtained from the qPCR results and the sample dilution levels used.
NOTE: For example, an average concentration of 2 pM for a sample diluted 1:100,000 yields an original starting concentration of 200 nM. If none of the dilution levels for a given sample falls within the range of the standard curves, quantification results may not be accurate; in which case, adjust the dilution levels and re-perform qPCR.
Dilute each purified, barcoded sample to 4 nM in TE buffer, based on the average concentrations calculated in step 7.5.
NOTE: For example, for an average sample concentration of 200 nM, dilute the sample 1:50 in TE buffer to achieve a 4 nM concentration. The precise sample concentration will vary based on the sequencing system and reagent kit used. Refer to the sequencing system user guide for the recommended concentration.
Create the combined library by adding equal volumes (typically 5-10 μL) of all individual libraries into a single tube.
NOTE: As all samples should be at a 4 nM concentration, adding equal volumes achieves an equal concentration of all samples in the pooled library.
Repeat steps 6.2-6.4 on the combined library to confirm the 4 nM concentration.
Calculate the combined library concentration and dilute or re-constitute the final, pooled library using Tris buffer as necessary to achieve 4 nM.
If not proceeding immediately to sample loading, store the samples at -20 °C. Perform the sequencing run shortly after quantification to minimize loss or changes to DNA concentration during storage.

7. Library Denaturation and Dilution, and Sequencing Run (perform on the same day)

Denature the 4 nM combined library from step 6.9 with NaOH.
NOTE: These final library preparation steps vary for each sequencing system. Refer to the sequencing system user guide for more detailed and updated protocols. New users will likely need to be trained on the usage of the sequencing platform or may send their libraries to a core sequencing facility.
Dilute the denatured library to the desired loading concentration using the buffer provided in the sequencing reagent kit (Table of Materials).
NOTE: Optimal loading concentrations vary by sequencing system and typically range from 1-250 pM²⁵.
Denature and dilute the sequencing control.
NOTE: Adding a sequencing control corrects for sequencing issues arising from low diversity libraries. Denature the sequencing control using NaOH and dilute to the same concentration as the library.
Combine the library and sequencing control.
NOTE: The ratio of sequencing control to combined library will depend on the library diversity and sequencing system used, but it is typically 1:1²⁶.
Load the combined library and sequencing control mixture onto the sequencing flow cell.

8. Amplicon Sequence Analysis

Assess the overall success of the run by examining the run metrics on the cloud computing environment corresponding to the sequencing system used.
NOTE: The run metrics, such as number of sequencing reads, will vary based on sequencing platform and reagent kit used. Target values for common sequencing systems are listed in Table 1.
Download the raw sequencing data and desired bioinformatics open source software, Quantitative Insights into Microbial Ecology (QIIME)²⁷ or Mothur²⁸.
NOTE: Both QIIME and Mothur are open-source and free to download. Detailed instructions on using these programs can be found online and are sufficiently detailed for first-time users. The steps below provide a broad overview of a bioinformatics pipeline conducted in QIIME. Familiarity with python is not necessary but will facilitate the implementation of the following scripts
Create and validate the mapping file using the "validate_mapping_file.py" script in QIIME.
NOTE: The mapping file contains all the metadata necessary to perform data analysis, including sample ID, amplicon primer sequences, and sample description. The validation step checks that all necessary data have been entered in the proper format.
Demultiplex and filter sequences using the "split_libraries.py" script (Figure 1).
NOTE: This script assigns barcoded reads to samples based on the index primer combinations and sample IDs input in the mapping file. It also performs several quality filtering steps to control for sequencing error based on user-defined cut-offs for minimum quality score, sequence lengths, and end-trimming.
Assign operational taxonomic units (OTUs) to sequences using the "pick_open_references_otus.py" script.
NOTE: In this step, sequences are clustered against a reference sequence collection based on a threshold of identity (typically 97%). Reads which do not match the reference sequence collection are clustered against one another. De novo and closed-reference OTU picking options are also available, but open-reference picking is recommended by QIIME developers.
Normalize the OTU table using the "alpha_rarefaction.py" or "normalize_table.py" script.
NOTE: This step corrects for variation in column sums, or total sequence reads per sample, that result from modern sequencing technologies. Normalization can be performed using traditional rarefaction, or through alternative methods such as cumulative sum scaling.
Perform several alpha-diversity, beta-diversity, and taxonomic composition diversity analyses at once using the "core_diversity_analysis.py" script.
NOTE: Alternatively, these analyses can be run separately using the individual scripts (e.g., "alpha_diversity.py").

Subscription Required. Please recommend JoVE to your librarian.

Representative Results

A total of 42 ticks from three separate egg clutches and two environmental exposure periods, 0 and 2 weeks in soil, were processed for microbiome sequencing. Each treatment group, considered to be a single clutch and exposure time, contained 6-8 replicate tick samples. These processed tick extracts were loaded onto a next-generation sequencer and yielded 12,885,713 paired-end reads passing filter. Included in this run were 3 negative controls from the extraction step, yielding a total of 211,214 reads (included in previous count). Further run quality metrics along with optimal values for each metric are detailed in Table 2. Rarefaction curves, which relate sequencing effort to number of OTUs per sample, indicated that a sequencing depth, or number of unique sequence reads per sample, of 2,129 reads would be sufficient to adequately capture the diversity of the microbial community (Figure 2). Rarefaction levels will vary based on sample type and must be determined individually for each sequencing run. After rarefying to this depth, 1,714 OTUs were identified across all samples with an average of 93.3 ± 4.3 OTUs per sample. To avoid downstream analysis issues arising from sparse matrices and to remove potential contaminants, all OTUs not found in at least one sample at ≥ 1% abundance were pooled into a rare general category. Further, the decontam package in R was utilized to identify OTUs over-represented in negative controls relative to real samples, and these OTUs were removed from downstream analysis.

Community ecology analyses were performed using the QIIME diversity analyses workflow to demonstrate the types of alpha diversity, beta diversity, and taxonomy composition diversity output generated using a simple bioinformatics pipeline. For example, weighted and unweighted Unifrac principal coordinates analyses were produced using the "core_diversity_analyses.py" script, and they revealed spatial clustering of larval ticks based on clutch identity (Figure 3). Boxplots of OTU counts at varying environmental exposure times were also generated through this script, demonstrating differences in microbiome species richness over time (Figure 4). These alpha and beta diversity analyses are performed based on the user-defined categories listed in the mapping file, but figures and statistics on general taxonomic information are also generated for all samples (Figure 5).

Figure 1: Microbiome sequencing workflow. Major steps of the microbiome sequencing workflow are displayed with call-outs for the library preparation and sequencing analysis steps. Please click here to view a larger version of this figure.

Figure 2: Rarefaction curves for sequence count normalization. Rarefaction curves, shown for each cohort of ticks, relate observed OTU counts to sampling effort. Error bars are present due to replicate samples present within each clutch, and they denote one standard deviation. A sequencing depth should be selected at or beyond the point where the curve becomes stable to adequately capture the full diversity of OTUs. Here, a sequencing depth of 2,000 reads appears appropriate. Please click here to view a larger version of this figure.

Figure 3: Principal coordinate analysis by clutch. Unweighted Unifrac principal coordinate analysis (PCoA) shows variation in microbiome composition between larval ticks from different clutches, and from adults. Each data point denotes an individual tick. This figure is automatically generated through the QIIME "core_diversity_analyses.py" script. Please click here to view a larger version of this figure.

Figure 4: Alpha diversity boxplots by exposure time. Boxplots depicting OTU counts for environmental exposure groups show variation in microbial diversity over time. This figure is automatically generated through the QIIME "core_diversity_analyses.py" script. Please click here to view a larger version of this figure.

Figure 5: Phylum-level microbial identification for clutch one. Microbiome composition for individual tick samples from clutch one is shown at the phylum level. This figure, as well as summary information at lower taxonomic levels, is automatically generated through the QIIME "core_diversity_analyses.py" script. Please click here to view a larger version of this figure.

Metric	Definition	Our results (MiSeq V3)	Optimal for MiSeq V3	Optimal for HiSeq Rapid Mode
Reads PF	Number of sequencing reads passing the chastity filter. A read passes filter if no more than 1 base call has a chastity value below 0.6 in the first 25 cycles	12,885,713	14-16 million	600 million
Error Rate	Percentage of base pairs called incorrectly during a cycle, based on reads aligned to PhiX control	3.28 ±0.10	*Depends on values for other metrics	*Depends on values for other metrics
% ≥Q30	Percentage of bases with a Q score ≥ 30, indicating a base call accuracy of 99.9%	68.94	> 70%	> 80%
Cluster PF (%)	Percentage of clusters passing the chastity filter.	85.61 ±0.81	90%	90%
Density	Density of clusters on the flowcell - a key metric for data quality and output	552 ± 35 cluster/mm^2	1200-1400 cluster/mm^2	850-1000 cluster/mm^2

Table 1: Key performance parameters for NGS output. User data and target values reported based on the sequencing platform and reagent kit used in this study (Table of Materials) and a 25% sequencing control addition.

Subscription Required. Please recommend JoVE to your librarian.

Discussion

Next-generation sequencing of 16S rRNA has become a standard approach for microbial identification and enabled the study of how vector microbiomes affect pathogen transmission. The protocol outlined here details the use of this method to investigate microbial community assembly in I. pacificus, a vector species for Lyme disease; however, it can easily be applied to study other tick species or arthropod vector species.

Indeed, 16S rRNA sequencing for microbiome analysis has been used broadly to study the microbiomes of vectors including mosquitoes, psyllids, and tsetse flies²⁹^,³⁰^,³¹. Other methods available for microbial identification within vectors include microscopy, culturing, and histological staining. These methods may be more appropriate than sequencing if the goal is to identify and describe a novel microbe (microscopy), evaluate the effect of antimicrobial drugs (culture), or localize specific and known microbes within the vector (histological staining). However, these methods suffer from low specificity, detection, and scalability, and thus are less appropriate for identifying the full community of microbes within a vector or characterizing the vector microbiome under varying ecological and experimental conditions. Conversely, high-throughput sequencing of the 16S rRNA gene enables identification of low-abundance and non-culturable bacteria, provides high resolution and detection given the comprehensive reference databases, and can provide high replication depending on coverage needs.

While 16S rRNA sequencing is now widely used for microbial identification, this technique is not without limitations. Principally, microbial contamination can obscure interpretation of the sequencing results and confound biological meaning³². Furthermore, given the use of universal bacterial primers and sensitivity to low starting concentrations that are inherent to 16S rRNA sequencing, microbial contamination is common³². Sources of contamination include PCR reagents, DNA extraction kits, laboratory surfaces, and the skin and clothing of researchers³³^,³⁴^,³⁵. The effects of microbial contaminants can be minimized by working in a sterile lab environment, using negative controls and technical replicates, and keeping a record of all kits and reagents used³⁶.

In addition to microbial contamination, low-quality sequencing output can greatly hinder the usability of microbiome sequencing data. Data quality can be assessed by a number of run metrics including the number of reads passing filter, the percentage of reads above a Q-score (a measure of the predicted probability of error in base-calling) of 30, and cluster density. Values for these metrics will vary based on the number of samples run, the sequencing system, the reagent cartridge, and the percent sequencing control used, but optimal values based on run conditions are available online. In particular, cluster density, the number of library clusters on a given plane of the flow cell prior to sequencing, is a key parameter for optimizing data quality and yield. Both over and under-clustering can reduce data output and result in samples being excluded from analysis due to insufficient coverage. Poor cluster density often reflects inaccurate library quantification; thus, care should be taken to perform proper DNA clean-up, individual sample quantification, and whole library quantification.

Assuming high data quality and yield are achieved, divergent approaches in data analysis used to overcome statistical challenges of large datasets pose another limitation of this approach. For example, multiple methods exist to address variation in read counts between samples. Rarefaction, which involves sub-sampling reads to achieve an equal sequencing depth across all samples, is frequently used but has been subject to critique recently for wasting large amounts of data and the subjective selection of minimum sequencing depth³⁷^,³⁸^,³⁹. Cumulative sum scaling (CSS), which keeps all sequence reads but weighs them based on the cumulative sum of counts within a given percentile, has been developed as an alternative technique. However, CSS has not been widely adopted due to its relative novelty and the confirmed utility of rarefaction for normalization prior to presence/absence analyses.

Standard data analysis procedures are also lacking for handling sequence reads in negative controls and distinguishing these from low abundance reads. As mentioned, sequencing negative controls generated from the DNA extraction step is recommended to help differentiate true vector microbiome residents from microbial contaminants during analysis. Yet, a standard and statistically rigorous approach for identifying and filtering suspected contaminants is lacking. A common approach is to remove OTUs present in the negative controls from all samples. However, this method may be overly conservative since many of these microbes likely originate from real samples rather than kit reagents⁴⁰. Another common technique involves grouping microbes present at < 1% into a "rare" category under the assumption that microbial contaminants are rare in real samples⁸^,⁴¹. However, this method may remove true vector microbes that are present at low abundances, particularly when the microbiome is dominated by an endosymbiont as seen in I. pacificus and I. holocyclus⁷^,⁴². In these cases, detecting rare microbes would require deeper sequencing, developing primers that inhibit the amplification of the endosymbiont during the amplicon PCR⁴², or computationally removing the endosymbiont during sequence analysis⁷. Selecting among the various methods to address rare and contaminant OTUs creates the opportunity for subjectivity and bias in microbiome data analysis, which may limit the ability to compare results across studies.

The choice of reference database, necessary for assigning taxonomy and phylogenetic information to sequence reads, presents another opportunity for divergence and subjectivity. While the task of relating sequence reads from a variable gene region to an identified parent genome is inherently challenging, a reliable reference database is crucial for accurate phylogenetic assignment. Multiple databases have emerged to meet this challenge, such as SILVA⁴³, Greengenes⁴⁴, RDP⁴⁵, and NCBI⁴⁶. SILVA is the largest of the 16S-based taxonomies, but SILVA, as well as RDP, only provides taxonomic information down to the genus level. Greengenes provides species-level information and is included in metagenomic analyses packages like QIIME, but it has not been updated in over four years. NCBI, while not a primary source for taxonomic information, provides daily updated classifications from user-submitted sequences, but it is uncurated. While selection among the databases typically depends on the users' needs regarding resolution, coverage, and currency, it has a significant impact on phylogenetic assignment and downstream analysis⁴⁷^,⁴⁸. Consensus among investigators regarding which reference database to use is thus imperative for avoiding additional bias in analyses.

Standard approaches to these data processing challenges will likely emerge as 16S rRNA sequencing becomes an increasingly common technique. Continued reductions in sequencing costs and time will further popularize this method. The increased usage of this technology will enable deeper investigations into the composition and ecology of vector microbiomes under diverse conditions. As knowledge of vector microbiome biology is still in its infancy, these types of descriptive studies are a critical first step preceding attempts to leverage the microbiome as a means of vector control. However, to truly understand the role of the microbiome in pathogen transmission, microbial identification must be coupled with knowledge of the functional role of these microbes. RNA sequencing and transcriptomic approaches, which involve mapping and quantifying gene expression, enable inferences into the functional role of microbes but require deep sequencing and fully assembled genomes of target species. Circumventing these challenges, computational tools to predict functional composition from 16S rRNA sequencing data have recently been developed but are not widely adopted yet⁴⁹. The development of such tools, as well as ecological theory, linking microbial identity and the functional role within vectors will increase the utility of 16S rRNA sequencing data in vector microbiome studies.

Subscription Required. Please recommend JoVE to your librarian.

Disclosures

The authors have nothing to disclose.

Acknowledgments

This work was supported by National Science Foundation grants to A.S. (DEB #1427772, 1745411, 1750037).

Materials

Name	Company	Catalog Number	Comments
Item	Name of Material/Equipment	Company	Catalog #
1	DNeasy Blood & Tissue Kit	Qiagen	69504
2	Qubit 4 Fluorometer	ThermoFisher Scientific	Q3326
3	NanoDrop 8000 Spectrophotometer	ThermoFisher Scientific	ND-8000-GL
4	2x KAPA HiFi HotStart ReadyMix	Kapa Biosystems	KK2501
5	AMPure XP beads	Agen Court	A63880
6	Magnetic Rack	ThermoFisher Scientific	MR02
6	TE buffer	Teknova	T0223
7	Nextera Index Kit	Illumina	FC-121-1011
8	KAPA Library Quantification Kit	Roche	KK4824
9	MiSeq System	Illumina	SY-410-1003
10	MiSeq Reagent Kit v3	Illumina	MS-102-3001
11	10 mM Tris-HCl with 0.1% Tween 20	Teknova	T7724