We present a strategic plan and protocol for identifying non-coding genetic variants affecting transcription factor (TF) DNA binding. A detailed experimental protocol is provided for electrophoretic mobility shift assay (EMSA) and DNA affinity precipitation assay (DAPA) analysis of genotype-dependent TF DNA binding.
Population and family-based genetic studies typically result in the identification of genetic variants that are statistically associated with a clinical disease or phenotype. For many diseases and traits, most variants are non-coding, and are thus likely to act by impacting subtle, comparatively hard to predict mechanisms controlling gene expression. Here, we describe a general strategic approach to prioritize non-coding variants, and screen them for their function. This approach involves computational prioritization using functional genomic databases followed by experimental analysis of differential binding of transcription factors (TFs) to risk and non-risk alleles. For both electrophoretic mobility shift assay (EMSA) and DNA affinity precipitation assay (DAPA) analysis of genetic variants, a synthetic DNA oligonucleotide (oligo) is used to identify factors in the nuclear lysate of disease or phenotype-relevant cells. For EMSA, the oligonucleotides with or without bound nuclear factors (often TFs) are analyzed by non-denaturing electrophoresis on a tris-borate-EDTA (TBE) polyacrylamide gel. For DAPA, the oligonucleotides are bound to a magnetic column and the nuclear factors that specifically bind the DNA sequence are eluted and analyzed through mass spectrometry or with a reducing sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE) followed by Western blot analysis. This general approach can be widely used to study the function of non-coding genetic variants associated with any disease, trait, or phenotype.
Sequencing and genotyping based studies, including Genome-Wide Association Studies (GWAS), candidate locus studies, and deep-sequencing studies, have identified many genetic variants that are statistically associated with a disease, trait, or phenotype. Contrary to early predictions, most of these variants (85-93%) are located in non-coding regions and do not change the amino acid sequence of proteins1,2. Interpreting the function of these non-coding variants and determining the biological mechanisms connecting them to the associated disease, trait, or phenotype has proven challenging3-6. We have developed a general strategy to identify the molecular mechanisms that link variants to an important intermediate phenotype – gene expression. This pipeline is specifically designed to identify modulation of TF binding by genetic variants. This strategy combines computational approaches and molecular biology techniques aimed to predict biological effects of candidate variants in silico, and verify these predictions empirically (Figure 1).
Figure 1: Strategic approach for the analysis of non-coding genetic variants. Steps that are not included in the detailed protocol associated with this manuscript are shaded in grey. Please click here to view a larger version of this figure.
In many cases, it is important to begin by expanding the list of variants to include all those in high linkage-disequilibrium (LD) with each statistically associated variant. LD is a measure of non-random association of alleles at two different chromosomal positions, which can be measured by the r2 statistic7. r2 is a measure of the linkage disequilibrium between two variants, with an r2 = 1 denoting perfect linkage between two variants. Alleles in high LD are found to co-segregate on the chromosome across ancestral populations. Current genotyping arrays do not include all known variants in the human genome. Instead, they exploit the LD within the human genome and include a subset of the known variants that act as proxies for other variants within a particular region of LD8. Thus, a variant without any biological consequence may be associated with a particular disease because it is in LD with the causal variant-the variant with a meaningful biological effect. Procedurally, it is recommended to convert the latest release of the 1,000 genomes project9 variant call files (vcf) into binary files compatible with PLINK10,11, an open-source tool for whole genome association analysis. Subsequently, all other genetic variants with LD r2 >0.8 with each input genetic variant can be identified as candidates. It is important to use the appropriate reference population for this step- e.g., if a variant was identified in subjects of European ancestry, data from subjects of similar ancestry should be used for LD expansion.
LD expansion often results in dozens of candidate variants, and it is likely that only a small fraction of these contribute to disease mechanism. Often, it is infeasible to experimentally examine each of these variants individually. It is therefore useful to leverage the thousands of publically available functional genomic datasets as a filter to prioritize the variants. For example, the ENCODE consortium12 has performed thousands of ChIP-seq experiments describing the binding of TFs and co-factors, and histone marks in a wide range of contexts, along with chromatin accessibility data from technologies such as DNase-seq13, ATAC-seq14, and FAIRE-seq15. Databases and web servers such as the UCSC Genome Browser16, Roadmap Epigenomics17, Blueprint Epigenome18, Cistrome19, and ReMap20 provide free access to data produced by these and other experimental techniques across a wide range of cell types and conditions. When there are too many variants to examine experimentally, these data can be used to prioritize those located within likely regulatory regions in relevant cell and tissue types. Further, in cases where a variant is within a ChIP-seq peak for a specific protein, these data can provide potential leads as to the specific TF(s) or co-factors whose binding might be affecting.
Next, the resulting prioritized variants are screened experimentally to validate predicted genotype-dependent protein binding using EMSA21,22. EMSA measures the change in the migration of the oligo on a non-reducing TBE gel. Fluorescently labeled oligo is incubated with the nuclear lysate, and binding of nuclear factors will retard the movement of the oligo on the gel. In this manner, oligo that has bound more nuclear factors will present as a stronger fluorescent signal upon scanning. Notably, EMSA does not require predictions about the specific proteins whose binding will be affected.
Once variants are identified that are located within predicted regulatory regions and are capable of differentially binding nuclear factors, computational methods are employed to predict the specific TF(s) whose binding they might affect. We prefer to use CIS-BP23,24, RegulomeDB25, UniProbe26, and JASPAR27. Once candidate TFs are identified, these predictions can be specifically tested using antibodies against these TFs (EMSA-supershifts and DAPA-Westerns). An EMSA-supershift involves the addition of a TF-specific antibody to the nuclear lysate and oligo. A positive result in an EMSA-supershift is represented as a further shift in the EMSA band, or a loss of the band (reviewed in reference28). In the complementary DAPA, a 5'-biotinylated oligo duplex containing the variant and the 20 base-pair flanking nucleotides are incubated with nuclear lysate from relevant cell type(s) to capture any nuclear factors specifically binding the oligos. The oligo duplex-nuclear factor complex is immobilized by streptavidin microbeads in a magnetic column. The bound nuclear factors are collected directly through elution29,48. Binding predictions can then be assessed by a Western blot using antibodies specific for the protein. In cases where there are no obvious predictions, or too many predictions, the elutions from variant pull-downs of the DAPA experiments can be sent to a proteomics core to identify candidate TFs using mass-spectrometry, which can subsequently be validated using these previously described methods.
In the remainder of the article, the detailed protocol for EMSA and DAPA analysis of genetic variants is provided.
1. Preparation of Solutions and Reagents
Name | Sequence |
rs76562819_REF_FOR | GTAATGCCTTAATGAGAGAGAGTTAGTCATCTTCTCACTTC |
rs76562819_REF_REV | GAAGTGAGAAGATGACTAACTCTCTCTCATTAAGGCATTAC |
rs76562819_NONREF_FOR | GTAATGCCTTAATGAGAGAGGGTTAGTCATCTTCTCACTTC |
rs76562819_NONREF_REV | GAAGTGAGAAGATGACTAACCCTCTCTCATTAAGGCATTAC |
Table 1: Example EMSA/DAPA oligonucleotide design to test a SNP for differential binding. "REF" stands for the reference allele, while "NONREF" stands for the non-reference allele. "FOR" stands for the forward strand, while "REV" indicates its complement. The SNP is seen in red.
2. Preparation of Nuclear Lysate from Cultured Cells
Note: This experimental protocol was optimized using B-lymphoblastoid cell lines, but has been tested in several other unrelated adherent/ suspension cell lines and works universally.
3. Electrophoretic Mobility Shift Assay (EMSA)
Reagent | Final Conc. | Rxn #1 | Rxn #2 | Rxn #3 | Rxn #4 |
Ultrapure Water | to 20 µl vol. | 13.5 µl | 11.98 µl | 13.5µl | 11.98 µl |
10x Binding Buffer | 1x | 2 μl | 2 μl | 2 μl | 2 µl |
DTT/TW-20 | 1x | 2 μl | 2 μl | 2 μl | 2 μl |
Salmon Sperm DNA | 500 ng/μl | 0.5 μl | 0.5 μl | 0.5 μl | 0.5 μl |
1μg/μl Poly d(I-C) | 1 μg | 1 μl | 1 μl | 1 μl | 1 μl |
Nuclear Extract (5.26 ug/µl) | 8 μg | – | 1.52 μl | – | 1.52 μl |
NE Buffer | 1.52 μl | – | 1.52 μl | – | |
Reference allele oligo | 50 fmol | 1 μl | 1 μl | – | – |
Non-Reference allele oligo | 50 fmol | – | – | 1 μl | 1 μl |
Table 2: Example EMSA reaction setup. The table illustrates an example EMSA to test the hypothesis that there is genotype-dependent binding of TFs to a specific SNP.
4. DNA Affinity Purification Assay (DAPA)
In this section, representative results of what to expect are provided when performing an EMSA or DAPA, and the variability with regards to the quality of lysate is characterized. For example, it has been suggested that freezing and thawing protein samples multiple times may result in denaturation. In order to explore the reproducibility of EMSA analysis in the context of these "freeze-thaw" cycles, two 35 bp oligos differing at one genetic variant were incubated with a single batch of nuclear lysate that was thawed and re-frozen for the indicated amount of times. Figure 2 demonstrates that freezing and thawing this particular B-lymphocyte nuclear extract up to 5 times has seemingly no effect on the integrity of the proteins; however, stability of nuclear protein varies by samples and should thus be tested on each individual cell line used. It is also possible to have variation from different batch preparations of nuclear lysates. Whether this variation is due to the stage of the cell cycle prior to lysis, cell passage number, or other factors, it is important to replicate EMSA results using different batches of lysate to ensure real results.
Additionally, it is important to optimize the signal-to-noise ratio for the EMSA. An important variable in this is the oligo concentration. The amount of oligo (5-300 fmol) was titrated to assess how different quantities of oligos affect the signal (Figure 3). An increase in the intensity of the band up to 100 fmol of oligo was observed. After this point, the signal plateaus suggesting 100 fmol is the optimal oligo amount for this EMSA reaction.
Lastly, a representative figure from one of our published studies using part of the strategy described in this manuscript is provided (Figure 4). In this study34, we showed that the lupus-associated risk allele of rs6590330 increases binding of STAT1, a transcription factor that participates in both synergistic activation and inhibition of gene expression downstream of the Type 1 IFN receptor complex35. In this example, STAT1 was first identified by DAPA followed by proteomic analysis using high performance liquid chromatography coupled to tandem mass spectroscopy (DAPA-MS). A DAPA-Western blot was used to confirm the TF identified from proteomic analysis (STAT1) and confirm that the phosphorylated form of STAT1 was binding to the non-reference (lupus-risk) oligo. This figure illustrates how the various assays in this strategy can be used to identify differential binding of transcription factor to a non-coding variant.
Figure 2: Analysis of reproducibility and the consequences of freeze-thaw cycles on EMSA results. Oligos containing the reference (R) or non-reference (NR) allele of a genetic variant were used to probe the same preparation of lysate after multiple cycles of freezing and thawing. (Lys: Lysate). Each lane contains the free probe at the bottom of the gel (blue arrow). Binding of the TFs to the oligo can be seen as a band in the top half of the gel (red arrow). Bands contained within the bottom half of the gel (see bracket) are non-specific. Please click here to view a larger version of this figure.
Figure 3: Assessment of the effect of different oligo concentrations. Various concentrations of oligos were used to probe a single preparation of nuclear lysate. Fluorescent signal increases with increased amounts oligo, indicating that the oligo is the limiting reagent. The signal plateaus at the 100-300 fmol lanes, where the amount of protein becomes the limiting reagent. Please click here to view a larger version of this figure.
Figure 4: The lupus-risk allele of rs6590330 increases STAT1 binding, as assessed by DAPA-MS. STAT1 and pSTAT1 exhibit higher binding to oligos containing the rs6590330 risk allele compared to the non-risk allele. Biotin-labeled oligos were incubated with Epstein-Barr virus-transformed B-cell nuclear extract. Proteins bound to the oligo were captured using DAPA. Proteins were then separated by SDS-polyacrylamide gel electrophoresis and detected using anti-pSTAT1 (A) or anti-STAT1 (B). M: Marker. NR: oligo containing the non-risk allele of rs6590330; R: oligo containing the risk allele of rs6590330; Mutant: oligo containing a disrupted putative STAT1 binding site downstream of rs6590330; Cell lysate: nuclear extract from B cells. The relative intensities of the bands are indicated below each band. Results are representative of four experiments; while all experiments demonstrated increased STAT1 binding to the probes with the risk allele, in 2/4 experiments no STAT1 or pSTAT1 was detected in the immunoprecipitate from the non-risk oligo, while both were detected in the immunoprecipitate from the risk oligo. This figure has been modified from reference34. Please click here to view a larger version of this figure.
Although advances in sequencing and genotyping technologies have greatly enhanced our capacity to identify genetic variants associated with disease, our ability to understand the functional mechanisms impacted by these variants is lagging. A major source of the problem is that many disease-associated variants are located in n on-coding regions of the genome, which likely affect harder-to-predict mechanisms controlling gene expression. Here, we present a protocol based on the EMSA and DAPA techniques, valuable molecular tools for identifying genotype-dependent TF binding events that likely contribute to the function of many non-coding variants. Although these two techniques have been used extensively in the past, they have only recently been adapted for genetic variant analysis of TF binding. Beyond TFs, EMSA can also be used to analyze the effect of genetic variants on RNA binding proteins with only minor adjustments to the protocol36.
The protocol presented herein is simple and easy to perform; nevertheless, certain parts require additional considerations prior to experimentation. First, it is important to generate an initial list of variants for consideration by performing rigorous statistical analyses. An error in this step can derail all downstream analyses, as many variants may bind nuclear lysate differentially without contributing to disease risk. Additionally, the use of functional genomics data performed in pertinent cell types is crucial, as irrelevant cell types may lead to the generation of false positive TF binding predictions. Currently, the most widely-used functional genomic resources include ENCODE12 and Roadmap Epigenomics17. Additional information regarding gene expression levels in cell-types relevant to a disease of interest can be obtained from other resources such as ImmGen37, BioGPS38,39, and SNPsea20. For example, such resources can be used to filter TF binding predictions to only include those TFs that are expressed in relevant cell types.
It is also important to consider the caveats of in vitro experiments such as EMSA and DAPA. In particular, it is necessary to repeat experiments with separate nuclear lysate preparations to reduce false negatives. Moreover, a successful EMSA-supershift can present as a further shift in the EMSA band, in which antibody binds to the TF, or a loss of band, in which antibody blocks the TF's DNA binding domain. In either scenario, including an isotype control antibody and/or an antibody towards a different TF is useful in confirming the specificity of the supershift. Another consideration in performing a supershift is whether to include DTT/polysorbate in the reaction. DTT/polysorbate stabilizes the loading dye allowing for more accurate quantification of unbound DNA; however, it may also reduce disulfide linkages of antibodies resulting in failure of EMSA-supershift. It is recommended to try reactions with and without DTT/polysorbate when attempting to supershift a complex. The optimal amount of Poly d(I-C) and salmon sperm DNA per reaction must be determined experimentally by titration. Generally, titrating a range of 1-6 µg Poly d(I-C) and 50-500 ng salmon sperm is sufficient. One useful positive control for DAPA involves using a consensus sequence for a TF with a well characterized binding motif (obtained from databases such as CIS-BP23 or Factorbook40) and running a Western with an antibody specific to that TF. For both assays, a scrambled oligo can be used to show that any observed binding is specific to the oligos of interest.
Both the EMSA and DAPA techniques have experimental limitations. For example, binding affinity between a TF and DNA is affected in large part by the buffer conditions. Ideally, buffer conditions should mimic the endogenous conditions of the nucleus to allow for optimal binding. For EMSA, improper buffer conditions may result in a weak band or the loss of a band entirely. For DAPA, non-ideal conditions may cause the TF(s) to be eluted during the wash steps. Therefore, each assay is only effective in identifying TF-oligo binding under certain buffer conditions. The most universally useful buffer conditions are presented in the protocol above. A second limitation is that the experimental results from EMSA and DAPA methods provide little information about the mechanisms through which TFs bind to the oligos. TFs could bind to oligos directly or be recruited by other factors. Accordingly, it is important to analyze the oligo sequences computationally to predict how TFs might bind to oligos and verify these predictions experimentally. For example, a specific binding sequence can be mutated or a binding partner can be experimentally depleted. Finally, the amount of extract used needs to be titrated for each experiment to acquire optimal results. Too much lysate may saturate the TF-oligo binding and obscure any differential binding between risk and non-risk alleles. For additional troubleshooting, the reader can refer to several excellent review and method manuscripts30,41,42.
In addition to the assays described here, there are a variety of in vivo assays available to further study the role of variants within living cells (Figure 1, bottom). Chromatin immunoprecipitation followed by allele-specific quantitative polymerase chain reactions (ChIP-qPCR) studies can be undertaken to identify if the differential binding observed within the in vitro assays is replicated in living cells34. For example, if cells heterozygous for a variant of interest are used, qPCR experiments can detect the difference in enrichment between the two alleles in TF immunoprecipitated chromatin. ChIP requires specific ChIP-grade antibodies; however, if ChIP-grade TF antibodies are unavailable, transfecting cells with a flag-tagged TF is an effective alternative43. Additionally, the effect of differential binding on target gene expression can be investigated through expression quantitative trait loci (eQTL) analysis in relevant cell types using locally collected expression data from genotyped cells or publically available resources such as those compiled by Genevar44. Luciferase reporter assays can also be used to explore the degree to which differential binding of the protein affects gene expression. Such assays can be modified to work regardless of whether the variants are located in a promoter, enhancer, or repressor region45-47. Finally, genome-editing technologies, such as CRISPR/Cas948, can be used to generate cell lines that differ only at a single variant, which is vital for confirming causation. Such technologies can substantially reduce the experimental variance observed between cell lines derived from genetically diverse subjects, since functional readouts analyzing gene expression or another disease intermediate phenotype can be performed on the edited cell-line, and compared to non-edited cell lines.
The main advantage of the strategy presented is that it enables easy and rapid detection of genotype-dependent TF binding. By prioritizing the key genetic variants, further experiments can be designed to identify their biological effect and demonstrate their causality. Notably, this protocol can be applied to investigate any disease or phenotype associated variant identified from GWAS or fine-mapping. A large, growing trove of genetic data and lists of statistically associated genetic variants is already available. In most cases, the biological mechanisms driving statistical association of these variants is not clear. The strategy outlined in this proposal allows for accurate functional interpretation of the many non-coding disease-associated variants. Such knowledge is vital for the full elucidation of the molecular mechanisms driving any genetic-based disease.
The authors have nothing to disclose.
We thank Erin Zoller, Jessica Bene, and Lindsey Hays for input and direction in protocol development. MTW was supported in part by NIH R21 HG008186 and a Trustee Award grant from the Cincinnati Children’s Hospital Research Foundation. ZHP was supported in part by T32 GM063483-13.
Custom DNA Oligonucleotides | Integrated DNA Technologies | http://www.idtdna.com/site/order/oligoentry | |
Potassium Chloride | Fisher Scientific | BP366-500 | KCl, for CE buffer |
HEPES (1M) | Fisher Scientific | 15630-080 | For CE and NE buffer |
EDTA (0.5M), pH 8.0 | Life Technologies | R1021 | For CE, NE, and annealing buffer |
Sodium Chloride | Fisher Scientific | BP358-1 | NaCl, for NE buffer |
Tris-HCl (1M), pH 8.0 | Invitrogen | BP1756-100 | For annealing buffer |
Phosphate Buffered Saline (1X) | Fisher Scientific | MT21040CM | PBS, for cell wash |
DL-Dithiothreitol solution (1M) | Sigma | 646563 | Reducing agent |
PMSF | Thermo Scientific | 36978 | Protease Inhibitor |
Phosphatase Inhibitor Cocktail | Thermo Scientific | 78420 | Prevents dephosphorylation of TFs |
Nonidet P-40 Substitute | IBI Scientific | IB01140 | NP-40, for nuclear extraction |
BCA Protein Assay Kit | Thermo Scientific | 23225 | For measuring protein concentration |
Odyssey EMSA Buffer Kit | Licor | 829-07910 | Contains all necessary EMSA buffers |
TBE Gels, 6%, 12 Wells | Invitrogen | EC6265BOX | For EMSA |
TBE Buffer (10X) | Thermo Scientific | B52 | For EMSA |
FactorFinder Starting Kit | Miltenyi Biotec | 130-092-318 | Contains all necessary DAPA buffers |
Licor Odyssey CLx | Licor | Recommended scanner for DAPA/EMSA | |
Antibiotic-Antimycotic | Gibco | 15240-062 | Contains 10,000 units/mL of penicillin, 10,000 µg/mL of streptomycin, and 25 µg/mL of Fungizone® Antimycotic |
Fetal Bovine Serum | Gibco | 26140-079 | FBS, for culture media |
RPMI 1640 Medium | Gibco | 22400-071 | Contains L-glutamine and 25mM HEPES |