$$\rightleftharpoonup{xx}$$
$$\longleftharp{xx}$$,
$$\longrightharp{xx}$$,
The availability of fast, reliable, and cost-effective deep sequencing has revolutionized many aspects in the field of life sciences, allowing great depth in sequencing-based analyses. A remaining challenge lies in the innovative design and creation of representative sequencing libraries. Here we describe a protocol to capture nascent viral cDNA molecules, specifically the intermediates of the HIV-1 reverse transcription process.
The most critical step in this strategy is the ligation of an adaptor to the open 3'-termini in a quantitative and unbiased manner. Efficiencies of ligations between two ssDNA termini, both inter- and intramolecular, have been investigated and optimized for various applications11,26,27,28,29. The choice of using a hairpin adaptor with T4 DNA ligase under the conditions described in step 3.3 is the result of empirical optimization in which we evaluated different ligases, adaptors and reagents for the ligation of synthetic oligonucleotides representing HIV-1 sequences (Table 2) (data not shown). In these in vitro test reactions, we confirmed that the T4 DNA ligase mediated ligation of the hairpin adaptor, as described by Kwok et al.11, has a very low bias, and achieves near complete ligation of acceptor molecules when the adaptor is used in excess. The ligation efficiency was unaffected by the addition of nucleotide sequence to render the adaptor compatible for the multiplex primer system (see Figure 4). In comparison, we found that a thermostable 5'DNA/RNA ligase ("Ligase A", see Table of Materials for exact ligases compared here), which is an engineered RNA ligase that was developed in part to improve on ligation efficiency with ssDNA as the acceptor27, was indeed more effective at ligating two ssDNA molecules than RNA ligase ("Ligase B") but had a significant bias, with strong differences in ligation efficiency even between oligonucleotides with single base length differences [Table 2; HTP con mid G (a) and (b)]. Furthermore, we found only a minimal bias in reactions with "Ligase C" combined with an adaptor carrying a randomized 5'-termini (a strategy used to offset known nucleotide bias of "Ligase C"; see for example Ding et al.30). However, the "Ligase C"-mediated intermolecular ligations were incomplete, rendering the T4 DNA ligase system the superior choice.
Several quality control steps over the course of the protocol and the inclusion of positive and negative controls allow for the detection of potential problems before assay continuation and provide guidance for troubleshooting efforts. The qPCR quantifications in steps 2.2.2 and 2.3.12 ensure that the quantity of the input material is sufficient. Typical cDNA copy numbers in the 200 µL elution (from step 2.1) range from around 10,000 to 300,000 per µL. The hybrid capture step can result in some loss of overall HIV-1 cDNA quantity but should result in a strong enrichment of specific HIV-1 cDNA over cellular DNA, which can be determined by using appropriate primers to quantify genomic DNA before and after enrichment by qPCR or by measuring total DNA concentration. Recovered HIV-1 cDNA after the hybrid capture steps should be at least 10% of the input. Low starting material may otherwise explain a successful oligonucleotide positive control (see step 3.3.2) but only limited reads achieved in the samples. Low read numbers overall could also be explained by overestimation of the library concentration due to the presence of irrelevant DNA species without MiSeq adaptors. This would result in low cluster density and can be improved by determining the concentration of HIV-1 sequences in the library by qPCR in addition to the total DNA amount by fluorometric assays. Due to the highly sensitive nature of the method, special care should be taken to avoid even low-level contamination, both from other samples (in particular, from the high concentration control oligonucleotide stocks) as well as from laboratory equipment. Working in a UV sterilizing PCR workstation is beneficial in this regard. The automated gel electrophoresis of the final library (step 6.1.2) is a further quality control measure. The nucleic acid size range typically observed is between 150 to 500 nt. Primers that can be detected in the optional control after the PCR and before purification (see note in step 5.2) should now be absent. In a representative result, the sample intensity curve has a peak around 160 to 170 nt and a second sharper peak around 320 to 350 nt. This likely reflects the often-seen higher abundance in both relatively short (1 to 20 nt insert length) reverse transcripts and full-length strong-stop (180 to 182 nt insert length) (Figure 3b).
While the presented protocol and selected primers are specific for early HIV-1 reverse transcription constructs, the method is generally applicable to any study aiming to determine open 3'-termini of DNA. The main modifications required in other contexts will be the method for hybrid capture and the primer design strategy. For example, if the target is to be adapted to late HIV-1 transcripts, a larger number of different capturing biotinylated oligonucleotides annealing across the length of the cDNA would be advisable and will likely decrease the loss in the hybrid capture step. As mentioned in the introduction, it is important to consider limitations when designing the range over which 3'-termini are to be detected to avoid different sources of bias. First, there may be a bias in the PCR reactions if the templates with the adaptor are of vastly varying length. Second, the sequencing platform used here (e.g., MiSeq) has a preferred insert length range for optimal clustering, and significantly shorter and longer products may not be sequenced with the same efficiency. In part, this can be addressed computationally, as was done by calculating a correction factor for linear length bias (see Figure 4, bottom graph). However, if the region of where 3'-termini mapping is desired is long (> 1000 nt), it is more advisable to split the reactions with the ligated transcripts and use multiple upstream primers to assess 3'-termini in sections.
The analysis program was written in-house for the specific purpose of analyzing both the last nucleotide of the HIV-1 sequence adjacent to the fixed adaptor sequence as well as the base variation of all bases to identify any mutations. The individual steps comprise the following: first, the adaptor sequences are trimmed using the fastx-0.0.13 toolkit; then, any sequences that are duplicated (meaning identical sequences including the barcode) are removed. All remaining unique reads are then aligned to the HIV-1 sequence using Bowtie (http://bowtie-bio.sourceforge.net/index.shtml) with the maximum mismatch set at three bases. The template sequence is comprised of the first 635 nt of HIV-1 cDNA (NL4.3 strain), which includes the -sss sequence and the first strand transfer product up to the polypurine track (U5-R-U3-PPT; see Figure 1). Thereby, the provided software and templates are only directly suitable if the method is used for the same application (detection of early reverse transcripts of the HIV-1NL4.3). Adjustments will have to be made for other target sequences. The positions of the 3'-termini for each read were determined by the position in the alignment. Base calls for each position are recorded and mutation rates are calculated from the total coverage of each base, which varies, as reads are of different lengths and long inserts may not be entirely covered by the 125-base sequencing in Read2.
To conclude, we believe the described method to be a valuable tool for many types of studies. Obvious applications include investigations of the mechanisms underlying reverse transcription inhibition through antiretroviral drugs or cellular restriction factors. However, only relatively minor adjustments should be necessary to adapt the system to 3'-termini mapping within other single-stranded DNA viral intermediates, which are present, for example, in parvovirus replication. Furthermore, the principle of the method, particularly its optimized ligation step, can provide a core part of library preparation design for the characterization of any 3'-DNA extensions, including elongations catalyzed by cellular double-stranded DNA polymerases.