JoVE   
You do not have subscription access to articles in this section. Learn more about access.

  JoVE Biology

  
You do not have subscription access to articles in this section. Learn more about access.

  JoVE Neuroscience

  
You do not have subscription access to articles in this section. Learn more about access.

  JoVE Immunology and Infection

  
You do not have subscription access to articles in this section. Learn more about access.

  JoVE Clinical and Translational Medicine

  
You do not have subscription access to articles in this section. Learn more about access.

  JoVE Bioengineering

  
You do not have subscription access to articles in this section. Learn more about access.

  JoVE Applied Physics

  
You do not have subscription access to articles in this section. Learn more about access.

  JoVE Chemistry

  
You do not have subscription access to articles in this section. Learn more about access.

  JoVE Behavior

  
You do not have subscription access to articles in this section. Learn more about access.

  JoVE Environment

|   

JoVE Science Education

General Laboratory Techniques

You do not have subscription access to videos in this collection. Learn more about access.

Basic Methods in Cellular and Molecular Biology

You do not have subscription access to videos in this collection. Learn more about access.

Model Organisms I

You do not have subscription access to videos in this collection. Learn more about access.

Model Organisms II

You have trial access to videos in this collection until May 31, 2014.

In JoVE (1)

Other Publications (106)

Articles by Michael Q. Zhang in JoVE

 JoVE Biology

A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types

1Department of Applied Mathematics & Statistics, Stony Brook University, 2Computational Biology and Bioinformatics, Cold Spring Harbor Laboratory, 3Department of Molecular and Cell Biology, University of Texas at Dallas


JoVE 4273

Our Bayesian Change Point (BCP) algorithm builds on state-of-the-art advances in modeling change-points via Hidden Markov Models and applies them to chromatin immunoprecipitation sequencing (ChIPseq) data analysis. BCP performs well in both broad and punctate data types, but excels in accurately identifying robust, reproducible islands of diffuse histone enrichment.

Other articles by Michael Q. Zhang on PubMed

Functional Genomics As Applied to Mapping Transcription Regulatory Networks

The sequencing of the human genome and the entire genomes of many model organisms has resulted in the identification of many genes. Many large-scale experiments for generating gene disruptions and analyzing the phenotypes are underway to ascertain gene function. A future challenge will be to determine interaction and regulation of all the genes of an organism. Recent advances in functional genomic technology have begun to shine light on such gene network problems at both transcriptomic and proteomic levels. Functional genomics will not only elucidate what the genes do, but will also help determine when, where and how they are expressed as an orchestrated system. In this review, we discuss the functional genomics approaches to extract knowledge about transcription regulatory mechanisms from combinations of sequence data, microarray data and ChIP data. We focus in particular on the budding yeast Saccharomyces cerevisiae.

GFScan: a Gene Family Search Tool at Genomic DNA Level

We have developed GFScan(Gene Family Scan), a tool that identifies members of a gene family by searching genomic DNA sequences with genomic DNA motifs (or matrices) that are representative of the family. We have tested GFScan on four human gene families including the neurotransmitter-gated ion-channels (NGIC) family, the carbonic anhydrases (CA) family, the Dbl homology (DH) domain family, and the ETS-domain family. All known members of these families with motifs mapped to sequenced genomic DNA regions were found, whereas some novel genomic locations were also found to match the motifs, which may indicate new members in these families. Compared with other methods, GFScan recognized all true positives with much fewer false positives. We also showed that motifs constructed based on human genes could be used to search the mouse genome to identify orthologous family members in mouse. This program is available at http://www.cshl.org/mzhanglab/.

Computational Prediction of Eukaryotic Protein-coding Genes

The human genome sequence is the book of our life. Buried in this large volume are our genes, which are scattered as small DNA fragments throughout the genome and comprise a small percentage of the total text. Finding these indistinct 'needles' in a vast genomic 'haystack' can be extremely challenging. In response to this challenge, computational prediction approaches have proliferated in recent years that predict the location and structure of genes. Here, I discuss these approaches and explain why they have become essential for the analyses of newly sequenced genomes.

Extracting Functional Information from Microarrays: a Challenge for Functional Genomics

Gene Expression Profiling in Developing Human Hippocampus

The gene expression profile of developing human hippocampus is of particular interest and importance to neurobiologists devoted to development of the human brain and related diseases. To gain further molecular insight into the developmental and functional characteristics, we analyzed the expression profile of active genes in developing human hippocampus. Expressed sequence tags (ESTs) were selected by sequencing randomly selected clones from an original 3'-directed cDNA library of 150-day human fetal hippocampus, and a digital expression profile of 946 known genes that could be divided into 16 categories was generated. We also used for comparison 14 other expression profiles of related human neural cells/tissues, including human adult hippocampus. To yield more confidence regarding differential expression, a method was applied to attach normalized expression data to genes with a low false-positive rate (<0.05). Finally, hierarchical cluster analysis was used to exhibit related gene expression patterns. Our results are in accordance with anatomical and physiological observations made during the developmental process of the human hippocampus. Furthermore, some novel findings appeared to be unique to our results. The abundant expression of genes for cell surface components and disease-related genes drew our attention. Twenty-four genes are significantly different from adult, and 13 genes might be developing hippocampus-specific candidate genes, including wnt2b and some Alzheimer's disease-related genes. Our results could provide useful information on the ontogeny, development, and function of cells in the human hippocampus at the molecular level and underscore the utility of large-scale, parallel gene expression analyses in the study of complex biological phenomena.

Direct Coupling of the Cell Cycle and Cell Death Machinery by E2F

Unrestrained E2F activity forces S phase entry and promotes apoptosis through p53-dependent and -independent mechanisms. Here, we show that deregulation of E2F by adenovirus E1A, loss of Rb or enforced E2F-1 expression results in the accumulation of caspase proenzymes through a direct transcriptional mechanism. Increased caspase levels seem to potentiate cell death in the presence of p53-generated signals that trigger caspase activation. Our results demonstrate that mitogenic oncogenes engage a tumour suppressor network that functions at multiple levels to efficiently induce cell death. The data also underscore how cell cycle progression can be coupled to the apoptotic machinery.

The Argonaute Family: Tentacles That Reach into RNAi, Developmental Control, Stem Cell Maintenance, and Tumorigenesis

Computational Comparison of Two Mouse Draft Genomes and the Human Golden Path

The availability of both mouse and human draft genomes has marked the beginning of a new era of comparative mammalian genomics. The two available mouse genome assemblies, from the public mouse genome sequencing consortium and Celera Genomics, were obtained using different clone libraries and different assembly methods.

A Global Transcriptional Regulatory Role for C-Myc in Burkitt's Lymphoma Cells

Overexpression of c-Myc is one of the most common alterations in human cancers, yet it is not clear how this transcription factor acts to promote malignant transformation. To understand the molecular targets of c-Myc function, we have used an unbiased genome-wide location-analysis approach to examine the genomic binding sites of c-Myc in Burkitt's lymphoma cells. We find that c-Myc together with its heterodimeric partner, Max, occupy >15% of gene promoters tested in these cancer cells. The DNA binding of c-Myc and Max correlates extensively with gene expression throughout the genome, a hallmark attribute of general transcription factors. The c-Myc/Max heterodimer complexes also colocalize with transcription factor IID in these cells, further supporting a general role for overexpressed c-Myc in global gene regulation. In addition, transcription of a majority of c-Myc target genes exhibits changes correlated with levels of c-myc mRNA in a diverse set of tissues and cell lines, supporting the conclusion that c-Myc regulates them. Taken together, these results suggest a general role for overexpressed c-Myc in global transcriptional regulation in some cancer cells and point toward molecular mechanisms for c-Myc function in malignant transformation.

ESEfinder: A Web Resource to Identify Exonic Splicing Enhancers

Point mutations frequently cause genetic diseases by disrupting the correct pattern of pre-mRNA splicing. The effect of a point mutation within a coding sequence is traditionally attributed to the deduced change in the corresponding amino acid. However, some point mutations can have much more severe effects on the structure of the encoded protein, for example when they inactivate an exonic splicing enhancer (ESE), thereby resulting in exon skipping. ESEs also appear to be especially important in exons that normally undergo alternative splicing. Different classes of ESE consensus motifs have been described, but they are not always easily identified. ESEfinder (http://exon.cshl.edu/ESE/) is a web-based resource that facilitates rapid analysis of exon sequences to identify putative ESEs responsive to the human SR proteins SF2/ASF, SC35, SRp40 and SRp55, and to predict whether exonic mutations disrupt such elements.

Computer Software to Find Genes in Plant Genomic DNA

Gene finding is the most important phase of genome annotation. Eukaryotic genomes contain thousands of protein coding genes, and computational gene prediction would rapidly increase the pace of experimental confirmation of expressed genes at the bench. The purpose of this chapter is to discuss the use of different computer programs that identify protein-coding genes in large genomic sequences. We describe most commonly used gene prediction programs that are available on the World Wide Web and demonstrate the use of some of these programs by an example. We provide a list of these programs along with their. Web uniform resource locators (URLs) and suggest guidelines for successful gene finding.

Identifying Cooperativity Among Transcription Factors Controlling the Cell Cycle in Yeast

Transcription regulation in eukaryotes is known to occur through the coordinated action of multiple transcription factors (TFs). Recently, a few genome-wide transcription studies have begun to explore the combinatorial nature of TF interactions. We propose a novel approach that reveals how multiple TFs cooperate to regulate transcription in the yeast cell cycle. Our method integrates genome-wide gene expression data and chromatin immunoprecipitation (ChIP-chip) data to discover more biologically relevant synergistic interactions between different TFs and their target genes than previous studies. Given any pair of TFs A and B, we define a novel measure of cooperativity between the two TFs based on the expression patterns of sets of target genes of only A, only B, and both A and B. If the cooperativity measure is significant then there is reason to postulate that the presence of both TFs is needed to influence gene expression. Our results indicate that many cooperative TFs that were previously characterized experimentally indeed have high values of cooperativity measures in our analysis. In addition, we propose several novel, experimentally testable predictions of cooperative TFs that play a role in the cell cycle and other biological processes. Many of them hold interesting clues for cross talk between the cell cycle and other processes including metabolism, stress response and pseudohyphal differentiation. Finally, we have created a web tool where researchers can explore the exhaustive list of cooperative TFs and survey the graphical representation of the target genes' expression profiles. The interface includes a tool to dynamically draw a TF cooperativity network of 113 TFs with user-defined significance levels. This study is an example of how systematic combination of diverse data types along with new functional genomic approaches can provide a rigorous platform to map TF interactions more efficiently.

Transcription Factor Binding Element Detection Using Functional Clustering of Mutant Expression Data

As a powerful tool to reveal gene functions, gene mutation has been used extensively in molecular biology studies. With high throughput technologies, such as DNA microarray, genome-wide gene expression changes can be monitored in mutants. Here we present a simple approach to detect the transcription-factor-binding motif using microarray expression data from a mutant in which the relevant transcription factor is deleted. A core part of our approach is clustering of differentially expressed genes based on functional annotations, such as Gene Ontology (GO). We tested our method with eight microarray data sets from the Rosetta Compendium and were able to detect canonical binding motifs for at least four transcription factors. With the support of chromatin IP chip data, we also predict a possible variant of the Swi4 binding motif and recover a core motif for Arg80. Our approach should be readily applicable to microarray experiments using other types of molecular biology techniques, such as conditional knockout/overexpression or RNAi-mediated 'knockdown', to perturb the expression of a transcription factor. Functional clustering included in our approach may also provide new insights into the function of the relevant transcription factor.

Identifying Combinatorial Regulation of Transcription Factors and Binding Motifs

Combinatorial interaction of transcription factors (TFs) is important for gene regulation. Although various genomic datasets are relevant to this issue, each dataset provides relatively weak evidence on its own. Developing methods that can integrate different sequence, expression and localization data have become important.

Evidence and Characteristics of Putative Human Alpha Recombination Hotspots

Understanding recombination rate variation is very important for studying genome diversity and evolution, and for investigation of phenotypic association and genetic diseases. Recombination hotspots have been observed in many species and are well studied in yeast. Recent study demonstrated that recombination hotspots are also a ubiquitous feature of the human genome. But the nature of human hotspots remains largely unknown. We have developed and validated a novel computational method for testing the existence of hotspots as well as for localizing them with either unphased or phased genotyping data. To study the characteristics of hotspots within or close to genes, we scanned for unusually high levels of recombination using the European population samples in the SeattleSNPs database, and found evidence for the existence of human alpha hotspots similar to those of yeast. This type of hotspots, found at promoter regions, accounts for about half of the total detected and appears to depend on some specific transcription factor binding sites (such as CGCCCCCGC). These characteristics can explain the observed weak correlation between hotspots and GC-content, and their variation may contribute to the diversity of hotspot distribution among different individuals and species. These long-sought putative human alpha recombination hotspots should deserve further experimental investigations.

Genome-wide Prediction and Analysis of Function-specific Transcription Factor Binding Sites

DNA-binding transcription factors play a central role in transcription regulation, and the annotation of transcription-factor binding sites in upstream regions of human genes is essential for building a genome-wide regulatory network. We describe methodology to accurately predict the transcription-factor binding sites in the proximal-promoter region of function-specific genes. In order to increase the accuracy of transcription factor binding-site prediction, we rely on recent genome sequence data, known transcription factor binding-site matrices, and Gene Ontology biological-function-based gene classification. Using TRANSFAC position-frequency matrices, we detected individual and cooperating transcription-factor binding sites in proximal promoters of ENSEMBL annotated human genes. We used the over representation of detected binding sites in the proximal promoters as compared to the second exons to control specificity. We confirmed the majority of transcription-factor binding sites predicted in proximal promoters of immune-response genes with evidence from existing literature. We validated the predicted cooperation between transcription factors NF-kappa B and IRF in the regulation of gene expression with microarray transcript profiling data and literature-derived protein-protein interaction network. We also identified over-represented individual and pairs of transcription-factor binding sites in the proximal promoters of each Gene Ontology biological-process gene group. Our tools and analysis provide a new resource for deciphering transcription regulation in different biological paradigms.

Interacting Models of Cooperative Gene Regulation

Cooperativity between transcription factors is critical to gene regulation. Current computational methods do not take adequate account of this salient aspect. To address this issue, we present a computational method based on multivariate adaptive regression splines to correlate the occurrences of transcription factor binding motifs in the promoter DNA and their interactions to the logarithm of the ratio of gene expression levels. This allows us to discover both the individual motifs and synergistic pairs of motifs that are most likely to be functional, and enumerate their relative contributions at any arbitrary time point for which mRNA expression data are available. We present results of simulations and focus specifically on the yeast cell-cycle data. Inclusion of synergistic interactions can increase the prediction accuracy over linear regression to as much as 1.5- to 3.5-fold. Significant motifs and combinations of motifs are appropriately predicted at each stage of the cell cycle. We believe our multivariate adaptive regression splines-based approach will become more significant when applied to higher eukaryotes, especially mammals, where cooperative control of gene regulation is absolutely essential.

The Mouse Genome: Experimental Examination of Gene Predictions and Transcriptional Start Sites

The completion of the mouse and other mammalian genome sequences will provide necessary, but not sufficient, knowledge for an understanding of much of mouse biology at the molecular level. As a requisite next step in this process, the genes in mouse and their structure must be elucidated. In particular, knowledge of the transcriptional start site of these genes will be necessary for further study of their regulatory regions. To assess the current state of mouse genome annotation to support this activity, we identified several hundred gene predictions in mouse with varying levels of supporting evidence and tested them using RACE-PCR. Modifications were made to the procedure allowing pooling of RNA samples, resulting in a scaleable procedure. The results illustrate potential errors or omissions in the current 5' end annotations in 58% of the genes detected. In testing experimentally unsupported gene predictions, we were able to identify 58 that are not usually annotated as genes but produced spliced transcripts (approximately 25% success rate). In addition, in many genes we were able to detect novel exons not predicted by any gene prediction algorithms. In 19.8% of the genes detected in this study, multiple transcript species were observed. These data show an urgent need to provide direct experimental validation of gene annotations. Moreover, these results show that direct validation using RACE-PCR can be an important component of genome-wide validation. This approach can be a useful tool in the ongoing efforts to increase the quality of gene annotations, especially transcriptional start sites, in complex genomes.

Similarity of Position Frequency Matrices for Transcription Factor Binding Sites

Transcription-factor binding sites (TFBS) in promoter sequences of higher eukaryotes are commonly modeled using position frequency matrices (PFM). The ability to compare PFMs representing binding sites is especially important for de novo sequence motif discovery, where it is desirable to compare putative matrices to one another and to known matrices.

DWE: Discriminating Word Enumerator

Tissue-specific transcription factor binding sites give insight into tissue-specific transcription regulation.

TRED: a Transcriptional Regulatory Element Database and a Platform for in Silico Gene Regulation Studies

In order to understand gene regulation, accurate and comprehensive knowledge of transcriptional regulatory elements is essential. Here, we report our efforts in building a mammalian Transcriptional Regulatory Element Database (TRED) with associated data analysis functions. It collects cis- and trans-regulatory elements and is dedicated to easy data access and analysis for both single-gene-based and genome-scale studies. Distinguishing features of TRED include: (i) relatively complete genome-wide promoter annotation for human, mouse and rat; (ii) availability of gene transcriptional regulation information including transcription factor binding sites and experimental evidence; (iii) data accuracy is ensured by hand curation; (iv) efficient user interface for easy and flexible data retrieval; and (v) implementation of on-the-fly sequence analysis tools. TRED can provide good training datasets for further genome-wide cis-regulatory element prediction and annotation, assist detailed functional studies and facilitate the decipher of gene regulatory networks (http://rulai.cshl.edu/TRED).

From Worm to Human: Bioinformatics Approaches to Identify FOXO Target Genes

Longevity regulatory genes include the Forkhead transcription factor FOXO, in addition to NAD-dependent histone deacetylase silent information regulator 2 (Sir2). The FOXO/DAF-16 family of transcription factors constitute an evolutionarily conserved subgroup within a larger family known as winged helix or Forkhead transcriptional regulators. Here we demonstrate how to identify FOXO target genes and their potential cis-regulatory binding sites in the promoters via bioinformatics approaches. These results provide new testable hypotheses for further experimental verifications.

Identifying Tissue-selective Transcription Factor Binding Sites in Vertebrate Promoters

We present a computational method aimed at systematically identifying tissue-selective transcription factor binding sites. Our method focuses on the differences between sets of promoters that are associated with differentially expressed genes, and it is effective at identifying the highly degenerate motifs that characterize vertebrate transcription factor binding sites. Results on simulated data indicate that our method detects motifs with greater accuracy than the leading methods, and its detection of strongly overrepresented motifs is nearly perfect. We present motifs identified by our method as the most overrepresented in promoters of liver- and muscle-selective genes, demonstrating that our method accurately identifies known transcription factor binding sites and previously uncharacterized motifs.

Mining ChIP-chip Data for Transcription Factor and Cofactor Binding Sites

Identification of single motifs and motif pairs that can be used to predict transcription factor localization in ChIP-chip data, and gene expression in tissue-specific microarray data.

Genome-wide Promoter Extraction and Analysis in Human, Mouse, and Rat

Large-scale and high-throughput genomics research needs reliable and comprehensive genome-wide promoter annotation resources. We have conducted a systematic investigation on how to improve mammalian promoter prediction by incorporating both transcript and conservation information. This enabled us to build a better multispecies promoter annotation pipeline and hence to create CSHLmpd (Cold Spring Harbor Laboratory Mammalian Promoter Database) for the biomedical research community, which can act as a starting reference system for more refined functional annotations.

Distribution of SR Protein Exonic Splicing Enhancer Motifs in Human Protein-coding Genes

Exonic splicing enhancers (ESEs) are pre-mRNA cis-acting elements required for splice-site recognition. We previously developed a web-based program called ESEfinder that scores any sequence for the presence of ESE motifs recognized by the human SR proteins SF2/ASF, SRp40, SRp55 and SC35 (http://rulai.cshl.edu/tools/ESE/). Using ESEfinder, we have undertaken a large-scale analysis of ESE motif distribution in human protein-coding genes. Significantly higher frequencies of ESE motifs were observed in constitutive internal protein-coding exons, compared with both their flanking intronic regions and with pseudo exons. Statistical analysis of ESE motif frequency distributions revealed a complex relationship between splice-site strength and increased or decreased frequencies of particular SR protein motifs. Comparison of constitutively and alternatively spliced exons demonstrated slightly weaker splice-site scores, as well as significantly fewer ESE motifs, in the alternatively spliced group. Our results underline the importance of ESE-mediated SR protein function in the process of exon definition, in the context of both constitutive splicing and regulated alternative splicing.

Regulating Gene Expression Through RNA Nuclear Retention

Multiple mechanisms have evolved to regulate the eukaryotic genome. We have identified CTN-RNA, a mouse tissue-specific approximately 8 kb nuclear-retained poly(A)+ RNA that regulates the level of its protein-coding partner. CTN-RNA is transcribed from the protein-coding mouse cationic amino acid transporter 2 (mCAT2) gene through alternative promoter and poly(A) site usage. CTN-RNA is diffusely distributed in nuclei and is also localized to paraspeckles. The 3'UTR of CTN-RNA contains elements for adenosine-to-inosine editing, involved in its nuclear retention. Interestingly, knockdown of CTN-RNA also downregulates mCAT2 mRNA. Under stress, CTN-RNA is posttranscriptionally cleaved to produce protein-coding mCAT2 mRNA. Our findings reveal a role of the cell nucleus in harboring RNA molecules that are not immediately needed to produce proteins but whose cytoplasmic presence is rapidly required upon physiologic stress. This mechanism of action highlights an important paradigm for the role of a nuclear-retained stable RNA transcript in regulating gene expression.

Using CorePromoter to Find Human Core Promoters

The CorePromoter program is very useful for identification of transcriptional start sites (TSS) and core promoter regions when 5'-upstream genomic DNA sequences of human genes are available. It is very simple to use and can be accessed either through the Web or after downloading to a local computer. The protocols in this unit introduce its basic methodology and discuss how to apply it to a sample problem in conjunction with other gene-finding programs.

Large-scale Structure of Genomic Methylation Patterns

The mammalian genome depends on patterns of methylated cytosines for normal function, but the relationship between genomic methylation patterns and the underlying sequence is unclear. We have characterized the methylation landscape of the human genome by global analysis of patterns of CpG depletion and by direct sequencing of 3073 unmethylated domains and 2565 methylated domains from human brain DNA. The genome was found to consist of short (<4 kb) unmethylated domains embedded in a matrix of long methylated domains. Unmethylated domains were enriched in promoters, CpG islands, and first exons, while methylated domains comprised interspersed and tandem-repeated sequences, exons other than first exons, and non-annotated single-copy sequences that are depleted in the CpG dinucleotide. The enrichment of regulatory sequences in the relatively small unmethylated compartment suggests that cytosine methylation constrains the effective size of the genome through the selective exposure of regulatory sequences. This buffers regulatory networks against changes in total genome size and provides an explanation for the C value paradox, which concerns the wide variations in genome size that scale independently of gene number. This suggestion is compatible with the finding that cytosine methylation is universal among large-genome eukaryotes, while many eukaryotes with genome sizes <5 x 10(8) bp do not methylate their DNA.

DNA Motifs in Human and Mouse Proximal Promoters Predict Tissue-specific Expression

Comprehensive identification of cis-regulatory elements is necessary for accurately reconstructing gene regulatory networks. We studied proximal promoters of human and mouse genes with differential expression across 56 terminally differentiated tissues. Using in silico techniques to discover, evaluate, and model interactions among sequence elements, we systematically identified regulatory modules that distinguish elevated from inhibited expression in the corresponding transcripts. We used these putative regulatory modules to construct a single predictive model for each of the 56 tissues. These predictors distinguish tissue-specific elevated from inhibited expression with statistical significance in 80% of the tissues (45 of 56). The predictors also reveal synergy between cis-regulatory modules and explain large-scale tissue-specific differential expression. For testis and liver, the predictors include computationally predicted motifs. For most other tissues, the predictors reveal synergy between experimentally verified motifs and indicate genes that are regulated by similar tissue-specific machinery. The identification in proximal promoters of cis-regulatory modules with tissue-specific activity lays the groundwork for complete characterization and deciphering of cis-regulatory DNA code in mammalian genomes.

Profiling Alternatively Spliced MRNA Isoforms for Prostate Cancer Classification

Prostate cancer is one of the leading causes of cancer illness and death among men in the United States and world wide. There is an urgent need to discover good biomarkers for early clinical diagnosis and treatment. Previously, we developed an exon-junction microarray-based assay and profiled 1532 mRNA splice isoforms from 364 potential prostate cancer related genes in 38 prostate tissues. Here, we investigate the advantage of using splice isoforms, which couple transcriptional and splicing regulation, for cancer classification.

A Clustering Property of Highly-degenerate Transcription Factor Binding Sites in the Mammalian Genome

Transcription factor binding sites (TFBSs) are short DNA sequences interacting with transcription factors (TFs), which regulate gene expression. Due to the relatively short length of such binding sites, it is largely unclear how the specificity of protein-DNA interaction is achieved. Here, we have performed a genome-wide analysis of TFBS-like sequences for the transcriptional repressor, RE1 Silencing Transcription Factor (REST), as well as for several other representative mammalian TFs (c-myc, p53, HNF-1 and CREB). We find a nonrandom distribution of inexact sites for these TFs, referred to as highly-degenerate TFBSs, that are enriched around the cognate binding sites. Comparisons among human, mouse and rat orthologous promoters reveal that these highly-degenerate sites are conserved significantly more than expected by random chance, suggesting their positive selection during evolution. We propose that this arrangement provides a favorable genomic landscape for functional target site selection.

Characterization of RNase R-digested Cellular RNA Source That Consists of Lariat and Circular RNAs from Pre-mRNA Splicing

Besides linear RNAs, pre-mRNA splicing generates three forms of RNAs: lariat introns, Y-structure introns from trans-splicing, and circular exons through exon skipping. To study the persistence of excised introns in total cellular RNA, we used three Escherichia coli 3' to 5' exoribonucleases. Ribonuclease R (RNase R) thoroughly degrades the abundant linear RNAs and the Y-structure RNA, while preserving the loop portion of a lariat RNA. Ribonuclease II (RNase II) and polynucleotide phosphorylase (PNPase) also preserve the lariat loop, but are less efficient in degrading linear RNAs. RNase R digestion of the total RNA from human skeletal muscle generates an RNA pool consisting of lariat and circular RNAs. RT-PCR across the branch sites confirmed lariat RNAs and circular RNAs in the pool generated by constitutive and alternative splicing of the dystrophin pre-mRNA. Our results indicate that RNase R treatment can be used to construct an intronic cDNA library, in which majority of the intron lariats are represented. The highly specific activity of RNase R implies its ability to screen for rare intragenic trans-splicing in any target gene with a large background of cis-splicing. Further analysis of the intronic RNA pool from a specific tissue or cell will provide insights into the global profile of alternative splicing.

Adaptively Inferring Human Transcriptional Subnetworks

Although the human genome has been sequenced, progress in understanding gene regulation in humans has been particularly slow. Many computational approaches developed for lower eukaryotes to identify cis-regulatory elements and their associated target genes often do not generalize to mammals, largely due to the degenerate and interactive nature of such elements. Motivated by the switch-like behavior of transcriptional responses, we present a systematic approach that allows adaptive determination of active transcriptional subnetworks (cis-motif combinations, the direct target genes and physiological processes regulated by the corresponding transcription factors) from microarray data in mammals, with accuracy similar to that achieved in lower eukaryotes. Our analysis uncovered several new subnetworks active in human liver and in cell-cycle regulation, with similar functional characteristics as the known ones. We present biochemical evidence for our predictions, and show that the recently discovered G2/M-specific E2F pathway is wider than previously thought; in particular, E2F directly activates certain mitotic genes involved in hepatocellular carcinomas. Additionally, we demonstrate that this method can predict subnetworks in a condition-specific manner, as well as regulatory crosstalk across multiple tissues. Our approach allows systematic understanding of how phenotypic complexity is regulated at the transcription level in mammals and offers marked advantage in systems where little or no prior knowledge of transcriptional regulation is available.

Computational Prediction of Methylation Status in Human Genomic Sequences

Epigenetic effects in mammals depend largely on heritable genomic methylation patterns. We describe a computational pattern recognition method that is used to predict the methylation landscape of human brain DNA. This method can be applied both to CpG islands and to non-CpG island regions. It computes the methylation propensity for an 800-bp region centered on a CpG dinucleotide based on specific sequence features within the region. We tested several classifiers for classification performance, including K means clustering, linear discriminant analysis, logistic regression, and support vector machine. The best performing classifier used the support vector machine approach. Our program (called hdfinder) presently has a prediction accuracy of 86%, as validated with CpG regions for which methylation status has been experimentally determined. Using hdfinder, we have depicted the entire genomic methylation patterns for all 22 human autosomes.

An Increased Specificity Score Matrix for the Prediction of SF2/ASF-specific Exonic Splicing Enhancers

Numerous disease-associated point mutations exert their effects by disrupting the activity of exonic splicing enhancers (ESEs). We previously derived position weight matrices to predict putative ESEs specific for four human SR proteins. The score matrices are part of ESEfinder, an online resource to identify ESEs in query sequences. We have now carried out a refined functional SELEX screen for motifs that can act as ESEs in response to the human SR protein SF2/ASF. The test BRCA1 exon under selection was internal, rather than the 3'-terminal IGHM exon used in our earlier studies. A naturally occurring heptameric ESE in BRCA1 exon 18 was replaced with two libraries of random sequences, one seven nucleotides in length, the other 14. Following three rounds of selection for in vitro splicing via internal exon inclusion, new consensus motifs and score matrices were derived. Many winner sequences were demonstrated to be functional ESEs in S100-extract-complementation assays with recombinant SF2/ASF. Motif-score threshold values were derived from both experimental and statistical analyses. Motif scores were shown to correlate with levels of exon inclusion, both in vitro and in vivo. Our results confirm and extend our earlier data, as many of the same motifs are recognized as ESEs by both the original and our new score matrix, despite the different context used for selection. Finally, we have derived an increased specificity score matrix that incorporates information from both of our SF2/ASF-specific matrices and that accurately predicts the exon-skipping phenotypes of deleterious point mutations.

Predicting Methylation Status of CpG Islands in the Human Brain

Over 50% of human genes contain CpG islands in their 5'-regions. Methylation patterns of CpG islands are involved in tissue-specific gene expression and regulation. Mis-epigenetic silencing associated with aberrant CpG island methylation is one mechanism leading to the loss of tumor suppressor functions in cancer cells. Large-scale experimental detection of DNA methylation is still both labor-intensive and time-consuming. Therefore, it is necessary to develop in silico approaches for predicting methylation status of CpG islands.

A New Method for Detecting Human Recombination Hotspots and Its Applications to the HapMap ENCODE Data

Computational detection of recombination hotspots from population polymorphism data is important both for understanding the nature of recombination and for applications such as association studies. We propose a new method for this task based on a multiple-hotspot model and an (approximate) log-likelihood ratio test. A truncated, weighted pairwise log-likelihood is introduced and applied to the calculation of the log-likelihood ratio, and a forward-selection procedure is adopted to search for the optimal hotspot predictions. The method shows a relatively high power with a low false-positive rate in detecting multiple hotspots in simulation data and has a performance comparable to the best results of leading computational methods in experimental data for which recombination hotspots have been characterized by sperm-typing experiments. The method can be applied to both phased and unphased data directly, with a very fast computational speed. We applied the method to the 10 500-kb regions of the HapMap ENCODE data and found 172 hotspots among the three populations, with average hotspot width of 2.4 kb. By comparisons with the simulation data, we found some evidence that hotspots are not all identical across populations. The correlations between detected hotspots and several genomic characteristics were examined. In particular, we observed that DNaseI-hypersensitive sites are enriched in hotspots, suggesting the existence of human beta hotspots similar to those found in yeast.

Pan-genome Isolation of Low Abundance Transcripts Using SAGE Tag

The SAGE (serial analysis of gene expression) method is sensitive at detecting the lower abundance transcripts. More than a third of human SAGE tags identified are novel representing the low abundance unknown transcripts. Using the GLGI method (generation of longer 3' EST from SAGE tag for gene identification), we converted 1009 low-copy, human X chromosome-specific SAGE tags into 10210 3' ESTs. We identified 3418 unique 3' ESTs, 46% of which are novel and originated from the lower abundance transcripts. However, nearly all 3' ESTs were mapped to various regions across the genome but not X chromosome. Detailed analysis indicates that those 3' ESTs were isolated by SAGE tag mis-priming to the non-parent transcripts. Replacing SAGE tags with non-transcribed genomic DNA tags resulted in poor amplification, indicating that the sequence similarity between different transcripts contributed to the amplification. Our study shows the prevalence of novel low abundance transcripts that can be isolated efficiently through SAGE tags mis-priming.

Computational Prediction of Novel Components of Lung Transcriptional Networks

Little is known regarding the transcriptional mechanisms involved in forming and maintaining epithelial cell lineages of the mammalian respiratory tract.

Tissue-specific Regulatory Elements in Mammalian Promoters

Transcription factor-binding sites and the cis-regulatory modules they compose are central determinants of gene expression. We previously showed that binding site motifs and modules in proximal promoters can be used to predict a significant portion of mammalian tissue-specific transcription. Here, we report on a systematic analysis of promoters controlling tissue-specific expression in heart, kidney, liver, pancreas, skeletal muscle, testis and CD4 T cells, for both human and mouse. We integrated multiple sources of expression data to compile sets of transcripts with strong evidence for tissue-specific regulation. The analysis of the promoters corresponding to these sets produced a catalog of predicted tissue-specific motifs and modules, and cis-regulatory elements. Predicted regulatory interactions are supported by statistical evidence, and provide a foundation for targeted experiments that will improve our understanding of tissue-specific regulatory networks. In a broader context, methods used to construct the catalog provide a model for the analysis of genomic regions that regulate differentially expressed genes.

Computing Exact P-values for DNA Motifs

Many heuristic algorithms have been designed to approximate P-values of DNA motifs described by position weight matrices, for evaluating their statistical significance. They often significantly deviate from the true P-value by orders of magnitude. Exact P-value computation is needed for ranking the motifs. Furthermore, surprisingly, the complexity of the problem is unknown.

Statistical Significance of Cis-regulatory Modules

It is becoming increasingly important for researchers to be able to scan through large genomic regions for transcription factor binding sites or clusters of binding sites forming cis-regulatory modules. Correspondingly, there has been a push to develop algorithms for the rapid detection and assessment of cis-regulatory modules. While various algorithms for this purpose have been introduced, most are not well suited for rapid, genome scale scanning.

Boosting with Stumps for Predicting Transcription Start Sites

Promoter prediction is a difficult but important problem in gene finding, and it is critical for elucidating the regulation of gene expression. We introduce a new promoter prediction program, CoreBoost, which applies a boosting technique with stumps to select important small-scale as well as large-scale features. CoreBoost improves greatly on locating transcription start sites. We also demonstrate that by further utilizing some tissue-specific information, better accuracy can be achieved.

Critical Roles for Dicer in the Female Germline

Dicer is an essential component of RNA interference (RNAi) pathways, which have broad functions in gene regulation and genome organization. Probing the consequences of tissue-restricted Dicer loss in mice indicates a critical role for Dicer during meiosis in the female germline. Mouse oocytes lacking Dicer arrest in meiosis I with multiple disorganized spindles and severe chromosome congression defects. Oogenesis and early development are times of significant post-transcriptional regulation, with controlled mRNA storage, translation, and degradation. Our results suggest that Dicer is essential for turnover of a substantial subset of maternal transcripts that are normally lost during oocyte maturation. Furthermore, we find evidence that transposon-derived sequence elements may contribute to the metabolism of maternal transcripts through a Dicer-dependent pathway. Our studies identify Dicer as central to a regulatory network that controls oocyte gene expression programs and that promotes genomic integrity in a cell type notoriously susceptible to aneuploidy.

Analysis of the Vertebrate Insulator Protein CTCF-binding Sites in the Human Genome

Insulator elements affect gene expression by preventing the spread of heterochromatin and restricting transcriptional enhancers from activation of unrelated promoters. In vertebrates, insulator's function requires association with the CCCTC-binding factor (CTCF), a protein that recognizes long and diverse nucleotide sequences. While insulators are critical in gene regulation, only a few have been reported. Here, we describe 13,804 CTCF-binding sites in potential insulators of the human genome, discovered experimentally in primary human fibroblasts. Most of these sequences are located far from the transcriptional start sites, with their distribution strongly correlated with genes. The majority of them fit to a consensus motif highly conserved and suitable for predicting possible insulators driven by CTCF in other vertebrate genomes. In addition, CTCF localization is largely invariant across different cell types. Our results provide a resource for investigating insulator function and possible other general and evolutionarily conserved activities of CTCF sites.

Predictive Models of Gene Regulation: Application of Regression Methods to Microarray Data

Eukaryotic transcription is a complex process. A myriad of biochemical signals cause activators and repressors to bind specific cis-elements on the promoter DNA, which help to recruit the basal transcription machinery that ultimately initiates transcription. In this chapter, we discuss how regression techniques can be effectively used to infer the functional cis-regulatory elements and their cooperativity from microarray data. Examples from yeast cell cycle are drawn to demonstrate the power of these techniques. Periodic regulation of the cell cycle, connection with underlying energetics, and the inference of combinatorial logic are also discussed. An implementation based on regression splines is discussed in detail.

A Highly Conserved Regulatory Element Controls Hematopoietic Expression of GATA-2 in Zebrafish

GATA-2 is a transcription factor required for hematopoietic stem cell survival as well as for neuronal development in vertebrates. It has been shown that specific expression of GATA-2 in blood progenitor cells requires distal cis-acting regulatory elements. Identification and characterization of these elements should help elucidating transcription regulatory mechanisms of GATA-2 expression in hematopoietic lineage.

Evolutionary Impact of Limited Splicing Fidelity in Mammalian Genes

The functional significance of most alternative splicing (AS) events, especially frame-shifting ones, has been controversial. Using human-mouse comparison, we demonstrate that frame-preserving AS events adapt and get fixed more rapidly than frame-shifting AS events; selection for smaller exon size is stronger in frame-preserving exons than in frame-shifting ones. These results suggest AS events introducing mild changes are generally favored during evolution and explain the excess of shorter, frame-preserving cassette exons in present mammalian genomes.

Dual-specificity Splice Sites Function Alternatively As 5' and 3' Splice Sites

As a result of large-scale sequencing projects and recent splicing-microarray studies, estimates of mammalian genes expressing multiple transcripts continue to increase. This expansion of transcript information makes it possible to better characterize alternative splicing events and gain insights into splicing mechanisms and regulation. Here, we describe a class of splice sites that we call dual-specificity splice sites, which we identified through genome-wide, high-quality alignment of mRNA/EST and genome sequences and experimentally verified by RT-PCR. These splice sites can be alternatively recognized as either 5' or 3' splice sites, and the dual splicing is conceptually similar to a pair of mutually exclusive exons separated by a zero-length intron. The dual-splice-site sequences are essentially a composite of canonical 5' and 3' splice-site consensus sequences, with a CAG|GURAG core. The relative use of a dual site as a 5' or 3' splice site can be accurately predicted by assuming competition for specific binding between spliceosomal components involved in recognition of 5' and 3' splice sites, respectively. Dual-specificity splice sites exist in human and mouse, and possibly in other vertebrate species, although most sites are not conserved, suggesting that their origin is recent. We discuss the implications of this unusual splicing pattern for the diverse mechanisms of exon recognition and for gene evolution.

Neural Potential of a Stem Cell Population in the Hair Follicle

The bulge region of the hair follicle serves as a repository for epithelial stem cells that can regenerate the follicle in each hair growth cycle and contribute to epidermis regeneration upon injury. Here we describe a population of multipotential stem cells in the hair follicle bulge region; these cells can be identified by fluorescence in transgenic nestin-GFP mice. The morphological features of these cells suggest that they maintain close associations with each other and with the surrounding niche. Upon explantation, these cells can give rise to neurosphere-like structures in vitro. When these cells are permitted to differentiate, they produce several cell types, including cells with neuronal, astrocytic, oligodendrocytic, smooth muscle, adipocytic, and other phenotypes. Furthermore, upon implantation into the developing nervous system of chick, these cells generate neuronal cells in vivo. We used transcriptional profiling to assess the relationship between these cells and embryonic and postnatal neural stem cells and to compare them with other stem cell populations of the bulge. Our results show that nestin-expressing cells in the bulge region of the hair follicle have stem cell-like properties, are multipotent, and can effectively generate cells of neural lineage in vitro and in vivo.

Computational Analyses of Eukaryotic Promoters

Computational analysis of eukaryotic promoters is one of the most difficult problems in computational genomics and is essential for understanding gene expression profiles and reverse-engineering gene regulation network circuits. Here I give a basic introduction of the problem and recent update on both experimental and computational approaches. More details may be found in the extended references. This review is based on a summer lecture given at Max Planck Institute at Berlin in 2005.

OSCAR: One-class SVM for Accurate Recognition of Cis-elements

Traditional methods to identify potential binding sites of known transcription factors still suffer from large number of false predictions. They mostly use sequence information in a position-specific manner and neglect other types of information hidden in the proximal promoter regions. Recent biological and computational researches, however, suggest that there exist not only locational preferences of binding, but also correlations between transcription factors.

Prediction of Transcription Start Sites Based on Feature Selection Using AMOSA

To understand the regulation of the gene expression, the identification of transcription start sites (TSSs) is a primary and important step. With the aim to improve the computational prediction accuracy, we focus on the most challenging task, i.e., to identify the TSSs within 50 bp in non-CpG related promoter regions. Due to the diversity of non-CpG related promoters, a large number of features are extracted. Effective feature selection can minimize the noise, improve the prediction accuracy, and also to discover biologically meaningful intrinsic properties. In this paper, a newly proposed multi-objective simulated annealing based optimization method, Archive Multi-Objective Simulated Annealing (AMOSA), is integrated with Linear Discriminant Analysis (LDA) to yield a combined feature selection and classification system. This system is found to be comparable to, often better than, several existing methods in terms of different quantitative performance measures.

Putative Zinc Finger Protein Binding Sites Are Over-represented in the Boundaries of Methylation-resistant CpG Islands in the Human Genome

Majority of CpG dinucleotides in mammalian genomes tend to undergo DNA methylation, but most CpG islands are resistant to such epigenetic modification. Understanding about mechanisms that may lead to the methylation resistance of CpG islands is still very poor.

Identification of Phylogenetically Conserved MicroRNA Cis-regulatory Elements Across 12 Drosophila Species

MicroRNAs are a class of endogenous small RNAs that play regulatory roles. Intergenic miRNAs are believed to be transcribed independently, but the transcriptional control of these crucial regulators is still poorly understood.

Genome-wide Mapping and Analysis of Active Promoters in Mouse Embryonic Stem Cells and Adult Organs

By integrating genome-wide maps of RNA polymerase II (Polr2a) binding with gene expression data and H3ac and H3K4me3 profiles, we characterized promoters with enriched activity in mouse embryonic stem cells (mES) as well as adult brain, heart, kidney, and liver. We identified approximately 24,000 promoters across these samples, including 16,976 annotated mRNA 5' ends and 5153 additional sites validating cap-analysis of gene expression (CAGE) 5' end data. We showed that promoters with CpG islands are typically non-tissue specific, with the majority associated with Polr2a and the active chromatin modifications in nearly all the tissues examined. By contrast, the promoters without CpG islands are generally associated with Polr2a and the active chromatin marks in a tissue-dependent way. We defined 4396 tissue-specific promoters by adapting a quantitative index of tissue-specificity based on Polr2a occupancy. While there is a general correspondence between Polr2a occupancy and active chromatin modifications at the tissue-specific promoters, a subset of them appear to be persistently marked by active chromatin modifications in the absence of detectable Polr2a binding, highlighting the complexity of the functional relationship between chromatin modification and gene expression. Our results provide a resource for exploring promoter Polr2a binding and epigenetic states across pluripotent and differentiated cell types in mammals.

Interferon Regulatory Factors Are Transcriptional Regulators of Adipogenesis

We have sought to identify transcriptional pathways in adipogenesis using an integrated experimental and computational approach. Here, we employ high-throughput DNase hypersensitivity analysis to find regions of altered chromatin structure surrounding key adipocyte genes. Regions that display differentiation-dependent changes in hypersensitivity were used to predict binding sites for proteins involved in adipogenesis. A high-scoring example was a binding motif for interferon regulatory factor (IRF) family members. Expression of all nine mammalian IRF mRNAs is regulated during adipogenesis, and several bind to the identified motifs in a differentiation-dependent manner. Furthermore, several IRF proteins repress differentiation. This analysis suggests an important role for IRF proteins in adipocyte biology and demonstrates the utility of this approach in identifying cis- and trans-acting factors not previously suspected to participate in adipogenesis.

Heat Shock Protein 90beta1 is Essential for Polyunsaturated Fatty Acid-induced Mitochondrial Ca2+ Efflux

Nonesterified fatty acids may influence mitochondrial function by alterations in gene expression, metabolism, and/or mitochondrial Ca(2+) ([Ca(2+)](m)) homeostasis. We have previously reported that polyunsaturated fatty acids induce Ca(2+) efflux from mitochondria, an action that may deplete [Ca(2+)](m) and thus contribute to nonesterified fatty acid-responsive mitochondrial dysfunction. Here we show that the chaperone protein heat shock protein 90 beta1 (hsp90beta1) is required for polyunsaturated fatty acid-induced mitochondrial Ca(2+) efflux (PIMCE). Retinoic acid induced differentiation of human teratocarcinoma NT2 cells in association with attenuation of PIMCE. Proteomic analysis of mitochondrial proteins revealed that hsp90beta1, among other proteins, was reduced in retinoic acid-differentiated cells. Blockade of PIMCE in NT2 cells by 17-(dimethylaminoethylamino)-17-demethoxygeldanamycin, a known inhibitor of the chaperone activity of hsp90, and hsp90beta1 RNA interference demonstrated that hsp90beta1 is essential for PIMCE. We also show localization of hsp90beta1 in mitochondria by Western blot and immunofluorescence. Distinctive effects of inhibitors binding to the N or C terminus of hsp90 on PIMCE in isolated mitochondria suggested that the C terminus of hsp90beta1 plays a critical role in PIMCE.

Using Quality Scores and Longer Reads Improves Accuracy of Solexa Read Mapping

Second-generation sequencing has the potential to revolutionize genomics and impact all areas of biomedical science. New technologies will make re-sequencing widely available for such applications as identifying genome variations or interrogating the oligonucleotide content of a large sample (e.g. ChIP-sequencing). The increase in speed, sensitivity and availability of sequencing technology brings demand for advances in computational technology to perform associated analysis tasks. The Solexa/Illumina 1G sequencer can produce tens of millions of reads, ranging in length from approximately 25-50 nt, in a single experiment. Accurately mapping the reads back to a reference genome is a critical task in almost all applications. Two sources of information that are often ignored when mapping reads from the Solexa technology are the 3' ends of longer reads, which contain a much higher frequency of sequencing errors, and the base-call quality scores.

CD36-dependent Regulation of Muscle FoxO1 and PDK4 in the PPAR Delta/beta-mediated Adaptation to Metabolic Stress

The transcription factor FoxO1 contributes to the metabolic adaptation to fasting by suppressing muscle oxidation of glucose, sparing it for glucose-dependent tissues. Previously, we reported that FoxO1 activation in C(2)C(12) muscle cells recruits the fatty acid translocase CD36 to the plasma membrane and increases fatty acid uptake and oxidation. This, together with FoxO1 induction of lipoprotein lipase, would promote the reliance on fatty acid utilization characteristic of the fasted muscle. Here, we show that CD36-mediated fatty acid uptake, in turn, up-regulates protein levels and activity of FoxO1 as well as its target PDK4, the negative regulator of glucose oxidation. Increased fatty acid flux or enforced CD36 expression in C(2)C(12) cells is sufficient to induce FoxO1 and PDK4, whereas CD36 knockdown has opposite effects. In vivo, CD36 loss blunts fasting induction of FoxO1 and PDK4 and the associated suppression of glucose oxidation. Importantly, CD36-dependent regulation of FoxO1 is mediated by the nuclear receptor PPARdelta/beta. Loss of PPARdelta/beta phenocopies CD36 deficiency in blunting fasting induction of muscle FoxO1 and PDK4 in vivo. Expression of PPARdelta/beta in C(2)C(12) cells, like that of CD36, robustly induces FoxO1 and suppresses glucose oxidation, whereas co-expression of a dominant negative PPARdelta/beta compromises FoxO1 induction. Finally, several PPRE sites were identified in the FoxO1 promoter, which was responsive to PPARdelta/beta. Agonists of PPARdelta/beta were sufficient to confer responsiveness and transactivate the heterologous FoxO1 promoter but not in the presence of dominant negative PPARdelta/beta. Taken together, our findings suggest that CD36-dependent FA activation of PPARdelta/beta results in the transcriptional regulation of FoxO1 as well as PDK4, recently shown to be a direct PPARdelta/beta target. FoxO1 in turn can regulate CD36, lipoprotein lipase, and PDK4, reinforcing the action of PPARdelta/beta to increase muscle reliance on FA. The findings could have implications in the chronic abnormalities of fatty acid metabolism associated with obesity and diabetes.

RNA Landscape of Evolution for Optimal Exon and Intron Discrimination

Accurate pre-mRNA splicing requires primary splicing signals, including the splice sites, a polypyrimidine tract, and a branch site, other splicing-regulatory elements (SREs). The SREs include exonic splicing enhancers (ESEs), exonic splicing silencers (ESSs), intronic splicing enhancers (ISEs), and intronic splicing silencers (ISSs), which are typically located near the splice sites. However, it is unclear to what extent splicing-driven selective pressure constrains exonic and intronic sequences, especially those distant from the splice sites. Here, we studied the distribution of SREs in human genes in terms of DNA strand-asymmetry patterns. Under a neutral evolution model, each mononucleotide or oligonucleotide should have a symmetric (Chargaff's second parity rule), or weakly asymmetric yet uniform, distribution throughout a pre-mRNA transcript. However, we found that large sets of unbiased, experimentally determined SREs show a distinct strand-asymmetry pattern that is inconsistent with the neutral evolution model, and reflects their functional roles in splicing. ESEs are selected in exons and depleted in introns and vice versa for ESSs. Surprisingly, this trend extends into deep intronic sequences, accounting for one third of the genome. Selection is detectable even at the mononucleotide level, so that the asymmetric base compositions of exons and introns are predictive of ESEs and ESSs. We developed a method that effectively predicts SREs based on strand asymmetry, expanding the current catalog of SREs. Our results suggest that human genes have been optimized for exon and intron discrimination through an RNA landscape shaped during evolution.

Network-based Global Inference of Human Disease Genes

Deciphering the genetic basis of human diseases is an important goal of biomedical research. On the basis of the assumption that phenotypically similar diseases are caused by functionally related genes, we propose a computational framework that integrates human protein-protein interactions, disease phenotype similarities, and known gene-phenotype associations to capture the complex relationships between phenotypes and genotypes. We develop a tool named CIPHER to predict and prioritize disease genes, and we show that the global concordance between the human protein network and the phenotype network reliably predicts disease genes. Our method is applicable to genetically uncharacterized phenotypes, effective in the genome-wide scan of disease genes, and also extendable to explore gene cooperativity in complex diseases. The predicted genetic landscape of over 1000 human phenotypes, which reveals the global modular organization of phenotype-genotype relationships. The genome-wide prioritization of candidate genes for over 5000 human phenotypes, including those with under-characterized disease loci or even those lacking known association, is publicly released to facilitate future discovery of disease genes.

Identification of Synaptic Targets of Drosophila Pumilio

Drosophila Pumilio (Pum) protein is a translational regulator involved in embryonic patterning and germline development. Recent findings demonstrate that Pum also plays an important role in the nervous system, both at the neuromuscular junction (NMJ) and in long-term memory formation. In neurons, Pum appears to play a role in homeostatic control of excitability via down regulation of para, a voltage gated sodium channel, and may more generally modulate local protein synthesis in neurons via translational repression of eIF-4E. Aside from these, the biologically relevant targets of Pum in the nervous system remain largely unknown. We hypothesized that Pum might play a role in regulating the local translation underlying synapse-specific modifications during memory formation. To identify relevant translational targets, we used an informatics approach to predict Pum targets among mRNAs whose products have synaptic localization. We then used both in vitro binding and two in vivo assays to functionally confirm the fidelity of this informatics screening method. We find that Pum strongly and specifically binds to RNA sequences in the 3'UTR of four of the predicted target genes, demonstrating the validity of our method. We then demonstrate that one of these predicted target sequences, in the 3'UTR of discs large (dlg1), the Drosophila PSD95 ortholog, can functionally substitute for a canonical NRE (Nanos response element) in vivo in a heterologous functional assay. Finally, we show that the endogenous dlg1 mRNA can be regulated by Pumilio in a neuronal context, the adult mushroom bodies (MB), which is an anatomical site of memory storage.

Combinatorial Patterns of Histone Acetylations and Methylations in the Human Genome

Histones are characterized by numerous posttranslational modifications that influence gene transcription. However, because of the lack of global distribution data in higher eukaryotic systems, the extent to which gene-specific combinatorial patterns of histone modifications exist remains to be determined. Here, we report the patterns derived from the analysis of 39 histone modifications in human CD4(+) T cells. Our data indicate that a large number of patterns are associated with promoters and enhancers. In particular, we identify a common modification module consisting of 17 modifications detected at 3,286 promoters. These modifications tend to colocalize in the genome and correlate with each other at an individual nucleosome level. Genes associated with this module tend to have higher expression, and addition of more modifications to this module is associated with further increased expression. Our data suggest that these histone modifications may act cooperatively to prepare chromatin for transcriptional activation.

Histone Methylation Marks Play Important Roles in Predicting the Methylation Status of CpG Islands

The methylation status of CpG islands is highly correlated with gene expression. Current methods for computational prediction of DNA methylation only utilize DNA sequence features. In this study, besides 35 DNA sequence features, we added four histone methylation marks to predict the methylation status of CpG islands, and improved the accuracy to 89.94%. Also we applied our model to predict the methylation pattern of all the CpG islands in the human genome, and the results are consistent with the previous reports. Our results imply the important roles of histone methylation marks in affecting the methylation status of CpG islands. H3K4me enriched in the methylation-resistant CpG islands could disrupt the contacts between nucleosomes, unravel chromatin and make DNA sequences accessible. And the established open environment may be a prerequisite for or a consequence of the function implementation of zinc finger proteins that could protect CpG islands from DNA methylation.

Poly A- Transcripts Expressed in HeLa Cells

Transcripts expressed in eukaryotes are classified as poly A+ transcripts or poly A- transcripts based on the presence or absence of the 3' poly A tail. Most transcripts identified so far are poly A+ transcripts, whereas the poly A- transcripts remain largely unknown.

Regulation of the PDK4 Isozyme by the Rb-E2F1 Complex

Loss of the transcription factor E2F1 elicits a complex metabolic phenotype in mice underscored by reduced adiposity and protection from high fat diet-induced diabetes. Here, we demonstrate that E2F1 directly regulates the gene encoding PDK4 (pyruvate dehydrogenase kinase 4), a key nutrient sensor and modulator of glucose homeostasis that is chronically elevated in obesity and diabetes and acutely induced under the metabolic stress of starvation or fasting. We show that loss of E2F1 in vivo blunts PDK4 expression and improves myocardial glucose oxidation. The absence of E2F1 also corresponds to lower blood glucose levels, improved plasma lipid profile, and increased sensitivity to insulin stimulation. Consistently, enforced E2F1 expression up-regulates PDK4 levels and suppresses glucose oxidation in C(2)C(12) myoblasts. Furthermore, inactivation of Rb, the repressor of E2F-dependent transcription, markedly induces PDK4 and triggers the enrichment of E2F1 occupancy onto the PDK4 promoter as detected by chromatin immunoprecipitation analysis. Two overlapping E2F binding sites were identified on this promoter. Transactivation assays later verified E2F1 responsiveness of this promoter element in C(2)C(12) myoblasts and IMR90 fibroblasts, an effect that was completely abrogated following mutation of the E2F sites. Taken together, our data illustrate how the E2F1 mitogen directly regulates PDK4 levels and influences cellular bioenergetics, namely mitochondrial glucose oxidation. These results are relevant to the pathophysiology of chronic diseases like obesity and diabetes, where PDK4 is dysregulated and could have implications pertinent to the etiology of tumor metabolism, especially in cancers with Rb pathway defects.

ZOOM! Zillions of Oligos Mapped

The next generation sequencing technologies are generating billions of short reads daily. Resequencing and personalized medicine need much faster software to map these deep sequencing reads to a reference genome, to identify SNPs or rare transcripts.

Defining the Regulatory Network of the Tissue-specific Splicing Factors Fox-1 and Fox-2

The precise regulation of many alternative splicing (AS) events by specific splicing factors is essential to determine tissue types and developmental stages. However, the molecular basis of tissue-specific AS regulation and the properties of splicing regulatory networks (SRNs) are poorly understood. Here we comprehensively predict the targets of the brain- and muscle-specific splicing factor Fox-1 (A2BP1) and its paralog Fox-2 (RBM9) and systematically define the corresponding SRNs genome-wide. Fox-1/2 are conserved from worm to human, and specifically recognize the RNA element UGCAUG. We integrate Fox-1/2-binding specificity with phylogenetic conservation, splicing microarray data, and additional computational and experimental characterization. We predict thousands of Fox-1/2 targets with conserved binding sites, at a false discovery rate (FDR) of approximately 24%, including many validated experimentally, suggesting a surprisingly extensive SRN. The preferred position of the binding sites differs according to AS pattern, and determines either activation or repression of exon recognition by Fox-1/2. Many predicted targets are important for neuromuscular functions, and have been implicated in several genetic diseases. We also identified instances of binding site creation or loss in different vertebrate lineages and human populations, which likely reflect fine-tuning of gene expression regulation during evolution.

Integrative Bioinformatics Analysis of Transcriptional Regulatory Programs in Breast Cancer Cells

Microarray technology has unveiled transcriptomic differences among tumors of various phenotypes, and, especially, brought great progress in molecular understanding of phenotypic diversity of breast tumors. However, compared with the massive knowledge about the transcriptome, we have surprisingly little knowledge about regulatory mechanisms underling transcriptomic diversity.

High-resolution Human Core-promoter Prediction with CoreBoost_HM

Correctly locating the gene transcription start site and the core-promoter is important for understanding transcriptional regulation mechanism. Here we have integrated specific genome-wide histone modification and DNA sequence features together to predict RNA polymerase II core-promoters in the human genome. Our new predictor CoreBoost_HM outperforms existing promoter prediction algorithms by providing significantly higher sensitivity and specificity at high resolution. We demonstrated that even though the histone modification data used in this study are from a specific cell type (CD4+ T-cell), our method can be used to identify both active and repressed promoters. We have applied it to search the upstream regions of microRNA genes, and show that CoreBoost_HM can accurately identify the known promoters of the intergenic microRNAs. We also identified a few intronic microRNAs that may have their own promoters. This result suggests that our new method can help to identify and characterize the core-promoters of both coding and noncoding genes.

Multi-stage Analysis of Gene Expression and Transcription Regulation in C57/B6 Mouse Liver Development

The liver performs a number of essential functions for life. The development of such a complex organ relies on finely regulated gene expression profiles which change over time in the development and determine the phenotype and function of the liver. We used high-density oligonucleotide microarrays to study the gene expression and transcription regulation at 14 time points across the C57/B6 mouse liver development, which include E11.5 (embryonic day 11.5), E12.5, E13.5, E14.5, E15.5, E16.5, E17.5, E18.5, Day0 (the day of birth), Day3, Day7, Day14, Day21, and normal adult liver. With these data, we made a comprehensive analysis on gene expression patterns, functional preferences and transcriptional regulations during the liver development. A group of uncharacterized genes which might be involved in the fetal hematopoiesis were detected.

The Seventh Asia Pacific Bioinformatics Conference (APBC2009)

Gene Set-based Module Discovery in the Breast Cancer Transcriptome

Although microarray-based studies have revealed global view of gene expression in cancer cells, we still have little knowledge about regulatory mechanisms underlying the transcriptome. Several computational methods applied to yeast data have recently succeeded in identifying expression modules, which is defined as co-expressed gene sets under common regulatory mechanisms. However, such module discovery methods are not applied cancer transcriptome data.

Determination of Enriched Histone Modifications in Non-genic Portions of the Human Genome

Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) has recently been used to identify the modification patterns for the methylation and acetylation of many different histone tails in genes and enhancers.

The Transcriptome of Human CD34+ Hematopoietic Stem-progenitor Cells

Studying gene expression at different hematopoietic stages provides insights for understanding the genetic basis of hematopoiesis. We analyzed gene expression in human CD34(+) hematopoietic cells that represent the stem-progenitor population (CD34(+) cells). We collected >459,000 transcript signatures from CD34(+) cells, including the de novo-generated 3' ESTs and the existing sequences of full-length cDNAs, ESTs, and serial analysis of gene expression (SAGE) tags, and performed an extensive annotation on this large set of CD34(+) transcript sequences. We determined the genes expressed in CD34(+) cells, verified the known genes and identified the new genes of different functional categories involved in hematopoiesis, dissected the alternative gene expression including alternative transcription initiation, splicing, and adenylation, identified the antisense and noncoding transcripts, determined the CD34(+) cell-specific gene expression signature, and developed the CD34(+) cell-transcription map in the human genome. Our study provides a current view on gene expression in human CD34(+) cells and reveals that early hematopoiesis is an orchestrated process with the involvement of over half of the human genes distributed in various functions. The data generated from our study provide a comprehensive and uniform resource for studying hematopoiesis and stem cell biology.

An Integrative Genomics Approach Identifies Hypoxia Inducible Factor-1 (HIF-1)-target Genes That Form the Core Response to Hypoxia

The transcription factor Hypoxia-inducible factor 1 (HIF-1) plays a central role in the transcriptional response to oxygen flux. To gain insight into the molecular pathways regulated by HIF-1, it is essential to identify the downstream-target genes. We report here a strategy to identify HIF-1-target genes based on an integrative genomic approach combining computational strategies and experimental validation. To identify HIF-1-target genes microarrays data sets were used to rank genes based on their differential response to hypoxia. The proximal promoters of these genes were then analyzed for the presence of conserved HIF-1-binding sites. Genes were scored and ranked based on their response to hypoxia and their HIF-binding site score. Using this strategy we recovered 41% of the previously confirmed HIF-1-target genes that responded to hypoxia in the microarrays and provide a catalogue of predicted HIF-1 targets. We present experimental validation for ANKRD37 as a novel HIF-1-target gene. Together these analyses demonstrate the potential to recover novel HIF-1-target genes and the discovery of mammalian-regulatory elements operative in the context of microarray data sets.

High Definition Profiling of Mammalian DNA Methylation by Array Capture and Single Molecule Bisulfite Sequencing

DNA methylation stabilizes developmentally programmed gene expression states. Aberrant methylation is associated with disease progression and is a common feature of cancer genomes. Presently, few methods enable quantitative, large-scale, single-base resolution mapping of DNA methylation states in desired regions of a complex mammalian genome. Here, we present an approach that combines array-based hybrid selection and massively parallel bisulfite sequencing to profile DNA methylation in genomic regions spanning hundreds of thousands of bases. This single molecule strategy enables methylation variable positions to be quantitatively examined with high sampling precision. Using bisulfite capture, we assessed methylation patterns across 324 randomly selected CpG islands (CGI) representing more than 25,000 CpG sites. A single lane of Illumina sequencing permitted methylation states to be definitively called for >90% of target sties. The accuracy of the hybrid-selection approach was verified using conventional bisulfite capillary sequencing of cloned PCR products amplified from a subset of the selected regions. This confirmed that even partially methylated states could be successfully called. A comparison of human primary and cancer cells revealed multiple differentially methylated regions. More than 25% of islands showed complex methylation patterns either with partial methylation states defining the entire CGI or with contrasting methylation states appearing in specific regional blocks within the island. We observed that transitions in methylation state often correlate with genomic landmarks, including transcriptional start sites and intron-exon junctions. Methylation, along with specific histone marks, was enriched in exonic regions, suggesting that chromatin states can foreshadow the content of mature mRNAs.

Updates to the RMAP Short-read Mapping Software

We report on a major new version of the RMAP software for mapping reads from short-read sequencing technology. General improvements to accuracy and space requirements are included, along with novel functionality. Included in the RMAP software package are tools for mapping paired-end reads, mapping using more sophisticated use of quality scores, collecting ambiguous mapping locations and mapping bisulfite-treated reads.

SFSSClass: an Integrated Approach for MiRNA Based Tumor Classification

MicroRNA (miRNA) expression profiling data has recently been found to be particularly important in cancer research and can be used as a diagnostic and prognostic tool. Current approaches of tumor classification using miRNA expression data do not integrate the experimental knowledge available in the literature. A judicious integration of such knowledge with effective miRNA and sample selection through a biclustering approach could be an important step in improving the accuracy of tumor classification.

Transcriptome Study for Early Hematopoiesis--achievement, Challenge and New Opportunity

Hematopoietic stem progenitor cells are the source for the entire hematopoietic system. Studying gene expression in hematopoietic stem progenitor cells will provide information to understand the genetic programs controlling early hematopoiesis, and to identify the gene targets to interfere hematopoietic disorders. Extensive efforts using cell biology, molecular biology, and genomics approaches have generated rich knowledge for the genes and functional pathways involving in early hematopoiesis. Challenges remain, however, including the rarity of the hematopoietic stem progenitor cells that set physical limitation for the study, the difficulty for reaching comprehensive transcriptome detection under the conventional genomics technologies, and the difficulty for using conventional biological methods to identify the key genes among large number of expressed genes controlling stem cell self-renewal and differentiation. The newly developed single-cell transcriptome method and the next-generation DNA sequencing technology provide new opportunities for transcriptome study for early hematopoietic. Using systems biology approach may reveal the insight of the genetic mechanisms controlling early hematopoiesis.

Development of the Human Cancer MicroRNA Network

MicroRNAs are a class of small noncoding RNAs that are abnormally expressed in different cancer cells. Molecular signature of miRNAs in different malignancies suggests that these are not only actively involved in the pathogenesis of human cancer but also have a significant role in patients survival. The differential expression patterns of specific miRNAs in a specific cancer tissue type have been reported in hundreds of research articles. However limited attempt has been made to collate this multitude of information and obtain a global perspective of miRNA dysregulation in multiple cancer types.

A Long Nuclear-retained Non-coding RNA Regulates Synaptogenesis by Modulating Gene Expression

A growing number of long nuclear-retained non-coding RNAs (ncRNAs) have recently been described. However, few functions have been elucidated for these ncRNAs. Here, we have characterized the function of one such ncRNA, identified as metastasis-associated lung adenocarcinoma transcript 1 (Malat1). Malat1 RNA is expressed in numerous tissues and is highly abundant in neurons. It is enriched in nuclear speckles only when RNA polymerase II-dependent transcription is active. Knock-down studies revealed that Malat1 modulates the recruitment of SR family pre-mRNA-splicing factors to the transcription site of a transgene array. DNA microarray analysis in Malat1-depleted neuroblastoma cells indicates that Malat1 controls the expression of genes involved not only in nuclear processes, but also in synapse function. In cultured hippocampal neurons, knock-down of Malat1 decreases synaptic density, whereas its over-expression results in a cell-autonomous increase in synaptic density. Our results suggest that Malat1 regulates synapse formation by modulating the expression of genes involved in synapse formation and/or maintenance.

Comparison of Sequencing-based Methods to Profile DNA Methylation and Identification of Monoallelic Epigenetic Modifications

Analysis of DNA methylation patterns relies increasingly on sequencing-based profiling methods. The four most frequently used sequencing-based technologies are the bisulfite-based methods MethylC-seq and reduced representation bisulfite sequencing (RRBS), and the enrichment-based techniques methylated DNA immunoprecipitation sequencing (MeDIP-seq) and methylated DNA binding domain sequencing (MBD-seq). We applied all four methods to biological replicates of human embryonic stem cells to assess their genome-wide CpG coverage, resolution, cost, concordance and the influence of CpG density and genomic context. The methylation levels assessed by the two bisulfite methods were concordant (their difference did not exceed a given threshold) for 82% for CpGs and 99% of the non-CpG cytosines. Using binary methylation calls, the two enrichment methods were 99% concordant and regions assessed by all four methods were 97% concordant. We combined MeDIP-seq with methylation-sensitive restriction enzyme (MRE-seq) sequencing for comprehensive methylome coverage at lower cost. This, along with RNA-seq and ChIP-seq of the ES cells enabled us to detect regions with allele-specific epigenetic states, identifying most known imprinted regions and new loci with monoallelic epigenetic marks and monoallelic expression.

Histone Modification Profiles Are Predictive for Tissue/cell-type Specific Expression of Both Protein-coding and MicroRNA Genes

Gene expression is regulated at both the DNA sequence level and through modification of chromatin. However, the effect of chromatin on tissue/cell-type specific gene regulation (TCSR) is largely unknown. In this paper, we present a method to elucidate the relationship between histone modification/variation (HMV) and TCSR.

ChIP-Array: Combinatory Analysis of ChIP-seq/chip and Microarray Gene Expression Data to Discover Direct/indirect Targets of a Transcription Factor

Chromatin immunoprecipitation (ChIP) coupled with high-throughput techniques (ChIP-X), such as next generation sequencing (ChIP-Seq) and microarray (ChIP-chip), has been successfully used to map active transcription factor binding sites (TFBS) of a transcription factor (TF). The targeted genes can be activated or suppressed by the TF, or are unresponsive to the TF. Microarray technology has been used to measure the actual expression changes of thousands of genes under the perturbation of a TF, but is unable to determine if the affected genes are direct or indirect targets of the TF. Furthermore, both ChIP-X and microarray methods produce a large number of false positives. Combining microarray expression profiling and ChIP-X data allows more effective TFBS analysis for studying the function of a TF. However, current web servers only provide tools to analyze either ChIP-X or expression data, but not both. Here, we present ChIP-Array, a web server that integrates ChIP-X and expression data from human, mouse, yeast, fruit fly and Arabidopsis. This server will assist biologists to detect direct and indirect target genes regulated by a TF of interest and to aid in the functional characterization of the TF. ChIP-Array is available at http://jjwanglab.hku.hk/ChIP-Array, with free access to academic users.

Functional Mutation of SMAC/DIABLO, Encoding a Mitochondrial Proapoptotic Protein, Causes Human Progressive Hearing Loss DFNA64

SMAC/DIABLO is a mitochondrial proapoptotic protein that is released from mitochondria during apoptosis and counters the inhibitory activities of inhibitor of apoptosis proteins, IAPs. By linkage analysis and candidate screening, we identified a heterozygous SMAC/DIABLO mutation, c.377C>T (p.Ser126Leu, refers to p.Ser71Leu in the mature protein) in a six-generation Chinese kindred characterized by dominant progressive nonsyndromic hearing loss, designated as DFNA64. SMAC/DIABLO is highly expressed in human embryonic ears and is enriched in the developing mouse inner-ear hair cells, suggesting it has a role in the development and homeostasis of hair cells. We used a functional study to demonstrate that the SMAC/DIABLO(S71L) mutant, while retaining the proapoptotic function, triggers significant degradation of both wild-type and mutant SMAC/DIABLO and renders host mitochondria susceptible to calcium-induced loss of the membrane potential. Our work identifies DFNA64 as the human genetic disorder associated with SMAC/DIABLO malfunction and suggests that mutant SMAC/DIABLO(S71L) might cause mitochondrial dysfunction.

Direct Cloning of Double-stranded RNAs from RNase Protection Analysis Reveals Processing Patterns of C/D Box SnoRNAs and Provides Evidence for Widespread Antisense Transcript Expression

We describe a new method that allows cloning of double-stranded RNAs (dsRNAs) that are generated in RNase protection experiments. We demonstrate that the mouse C/D box snoRNA MBII-85 (SNORD116) is processed into at least five shorter RNAs using processing sites near known functional elements of C/D box snoRNAs. Surprisingly, the majority of cloned RNAs from RNase protection experiments were derived from endogenous cellular RNA, indicating widespread antisense expression. The cloned dsRNAs could be mapped to genome areas that show RNA expression on both DNA strands and partially overlapped with experimentally determined argonaute-binding sites. The data suggest a conserved processing pattern for some C/D box snoRNAs and abundant expression of longer, non-coding RNAs in the cell that can potentially form dsRNAs.

Correlated Evolution of Transcription Factors and Their Binding Sites

The interaction between transcription factor (TF) and transcription factor binding site (TFBS) is essential for gene regulation. Mutation in either the TF or the TFBS may weaken their interaction and thus result in abnormalities. To maintain such vital interaction, a mutation in one of the interacting partners might be compensated by a corresponding mutation in its binding partner during the course of evolution. Confirming this co-evolutionary relationship will guide us in designing protein sequences to target a specific DNA sequence or in predicting TFBS for poorly studied proteins, or even correcting and rescuing disease mutations in clinical applications.

SpliceTrap: a Method to Quantify Alternative Splicing Under Single Cellular Conditions

Alternative splicing (AS) is a pre-mRNA maturation process leading to the expression of multiple mRNA variants from the same primary transcript. More than 90% of human genes are expressed via AS. Therefore, quantifying the inclusion level of every exon is crucial for generating accurate transcriptomic maps and studying the regulation of AS.

Study of FoxA Pioneer Factor at Silent Genes Reveals Rfx-repressed Enhancer at Cdx2 and a Potential Indicator of Esophageal Adenocarcinoma Development

Understanding how silent genes can be competent for activation provides insight into development as well as cellular reprogramming and pathogenesis. We performed genomic location analysis of the pioneer transcription factor FoxA in the adult mouse liver and found that about one-third of the FoxA bound sites are near silent genes, including genes without detectable RNA polymerase II. Virtually all of the FoxA-bound silent sites are within conserved sequences, suggesting possible function. Such sites are enriched in motifs for transcriptional repressors, including for Rfx1 and type II nuclear hormone receptors. We found one such target site at a cryptic "shadow" enhancer 7 kilobases (kb) downstream of the Cdx2 gene, where Rfx1 restricts transcriptional activation by FoxA. The Cdx2 shadow enhancer exhibits a subset of regulatory properties of the upstream Cdx2 promoter region. While Cdx2 is ectopically induced in the early metaplastic condition of Barrett's esophagus, its expression is not necessarily present in progressive Barrett's with dysplasia or adenocarcinoma. By contrast, we find that Rfx1 expression in the esophageal epithelium becomes gradually extinguished during progression to cancer, i.e, expression of Rfx1 decreased markedly in dysplasia and adenocarcinoma. We propose that this decreased expression of Rfx1 could be an indicator of progression from Barrett's esophagus to adenocarcinoma and that similar analyses of other transcription factors bound to silent genes can reveal unanticipated regulatory insights into oncogenic progression and cellular reprogramming.

EpiRegNet: Constructing Epigenetic Regulatory Network from High Throughput Gene Expression Data for Humans

The advances of high throughput profiling methods, such as microarray gene profiling and RNA-seq, have enabled researchers to identify thousands of differentially expressed genes under a certain perturbation. Much work has been done to understand the genetic factors that contribute to the expression changes by searching the over-represented regulatory motifs in the promoter regions of these genes. However, the changes could also be caused by epigenetic regulation, especially histone modifications, and no web server has been constructed to study the epigenetic factors responsible for gene expression changes. Here, we pre-sent a web tool for this purpose. Provided with different categories of genes (e.g., up-regulated, down-regulated or unchanged genes), the server will find epigenetic factors responsible for the difference among the categories and construct an epigenetic regulatory network. Furthermore, it will perform co-localization analyses between these epigenetic factors and transcription factors, which were collected from large scale experimental ChIP-seq or computational predicted data. In addition, for users who want to analyze dynamic change of a histone modification mark under different cell conditions, the server will find direct and indirect target genes of this mark by integrative analysis of experimental data and computational prediction, and present a regulatory network around this mark. Both networks can be visualized by a user friendly interface and the data are downloadable in batch. The server currently supports 12 cell types in human, including ESC and CD4+ T cells, and will expand as more public data are available. It also allows user to create a self-defined cell type, upload and analyze multiple ChIP-seq data. It is freely available to academic users at http://jjwanglab.org/EpiRegNet.

Identification of Tumor Suppressors and Oncogenes from Genomic and Epigenetic Features in Ovarian Cancer

The identification of genetic and epigenetic alterations from primary tumor cells has become a common method to identify genes critical to the development and progression of cancer. We seek to identify those genetic and epigenetic aberrations that have the most impact on gene function within the tumor. First, we perform a bioinformatic analysis of copy number variation (CNV) and DNA methylation covering the genetic landscape of ovarian cancer tumor cells. We separately examined CNV and DNA methylation for 42 primary serous ovarian cancer samples using MOMA-ROMA assays and 379 tumor samples analyzed by The Cancer Genome Atlas. We have identified 346 genes with significant deletions or amplifications among the tumor samples. Utilizing associated gene expression data we predict 156 genes with altered copy number and correlated changes in expression. Among these genes CCNE1, POP4, UQCRB, PHF20L1 and C19orf2 were identified within both data sets. We were specifically interested in copy number variation as our base genomic property in the prediction of tumor suppressors and oncogenes in the altered ovarian tumor. We therefore identify changes in DNA methylation and expression for all amplified and deleted genes. We statistically define tumor suppressor and oncogenic features for these modalities and perform a correlation analysis with expression. We predicted 611 potential oncogenes and tumor suppressors candidates by integrating these data types. Genes with a strong correlation for methylation dependent expression changes exhibited at varying copy number aberrations include CDCA8, ATAD2, CDKN2A, RAB25, AURKA, BOP1 and EIF2C3. We provide copy number variation and DNA methylation analysis for over 11,500 individual genes covering the genetic landscape of ovarian cancer tumors. We show the extent of genomic and epigenetic alterations for known tumor suppressors and oncogenes and also use these defined features to identify potential ovarian cancer gene candidates.

Identification of Novel Androgen-regulated Pathways and MRNA Isoforms Through Genome-wide Exon-specific Profiling of the LNCaP Transcriptome

Androgens drive the onset and progression of prostate cancer (PCa) by modulating androgen receptor (AR) transcriptional activity. Although several microarray-based studies have identified androgen-regulated genes, here we identify in-parallel global androgen-dependent changes in both gene and alternative mRNA isoform expression by exon-level analyses of the LNCaP transcriptome. While genome-wide gene expression changes correlated well with previously-published studies, we additionally uncovered a subset of 226 novel androgen-regulated genes. Gene expression pathway analysis of this subset revealed gene clusters associated with, and including the tyrosine kinase LYN, as well as components of the mTOR (mammalian target of rapamycin) pathway, which is commonly dysregulated in cancer. We also identified 1279 putative androgen-regulated alternative events, of which 325 (∼25%) mapped to known alternative splicing events or alternative first/last exons. We selected 30 androgen-dependent alternative events for RT-PCR validation, including mRNAs derived from genes encoding tumour suppressors and cell cycle regulators. Of seven positively-validating events (∼23%), five events involved transcripts derived from alternative promoters of known AR gene targets. In particular, we found a novel androgen-dependent mRNA isoform derived from an alternative internal promoter within the TSC2 tumour suppressor gene, which is predicted to encode a protein lacking an interaction domain required for mTOR inhibition. We confirmed that expression of this alternative TSC2 mRNA isoform was directly regulated by androgens, and chromatin immunoprecipitation indicated recruitment of AR to the alternative promoter region at early timepoints following androgen stimulation, which correlated with expression of alternative transcripts. Together, our data suggest that alternative mRNA isoform expression might mediate the cellular response to androgens, and may have roles in clinical PCa.

Inferring Haplotypes of Copy Number Variations from High-throughput Data with Uncertainty

Accurate information on haplotypes and diplotypes (haplotype pairs) is required for population-genetic analyses; however, microarrays do not provide data on a haplotype or diplotype at a copy number variation (CNV) locus; they only provide data on the total number of copies over a diplotype or an unphased sequence genotype (e.g., AAB, unlike AB of single nucleotide polymorphism). Moreover, such copy numbers or genotypes are often incorrectly determined when microarray signal intensities derived from different copy numbers or genotypes are not clearly separated due to noise. Here we report an algorithm to infer CNV haplotypes and individuals' diplotypes at multiple loci from noisy microarray data, utilizing the probability that a signal intensity may be derived from different underlying copy numbers or genotypes. Performing simulation studies based on known diplotypes and an error model obtained from real microarray data, we demonstrate that this probabilistic approach succeeds in accurate inference (error rate: 1-2%) from noisy data, whereas previous deterministic approaches failed (error rate: 12-18%). Applying this algorithm to real microarray data, we estimated haplotype frequencies and diplotypes in 1486 CNV regions for 100 individuals. Our algorithm will facilitate accurate population-genetic analyses and powerful disease association studies of CNVs.

Novel Markov Model of Induced Pluripotency Predicts Gene Expression Changes in Reprogramming

Somatic cells can be reprogrammed to induced-pluripotent stem cells (iPSCs) by introducing few reprogramming factors, which challenges the long held view that cell differentiation is irreversible. However, the mechanism of induced pluripotency is still unknown.

A Highly Efficient and Effective Motif Discovery Method for ChIP-seq/ChIP-chip Data Using Positional Information

Identification of DNA motifs from ChIP-seq/ChIP-chip [chromatin immunoprecipitation (ChIP)] data is a powerful method for understanding the transcriptional regulatory network. However, most established methods are designed for small sample sizes and are inefficient for ChIP data. Here we propose a new k-mer occurrence model to reflect the fact that functional DNA k-mers often cluster around ChIP peak summits. With this model, we introduced a new measure to discover functional k-mers. Using simulation, we demonstrated that our method is more robust against noises in ChIP data than available methods. A novel word clustering method is also implemented to group similar k-mers into position weight matrices (PWMs). Our method was applied to a diverse set of ChIP experiments to demonstrate its high sensitivity and specificity. Importantly, our method is much faster than several other methods for large sample sizes. Thus, we have developed an efficient and effective motif discovery method for ChIP experiments.

Cell-type-based Analysis of MicroRNA Profiles in the Mouse Brain

MicroRNAs (miRNA) are implicated in brain development and function but the underlying mechanisms have been difficult to study in part due to the cellular heterogeneity in neural circuits. To systematically analyze miRNA expression in neurons, we have established a miRNA tagging and affinity-purification (miRAP) method that is targeted to cell types through the Cre-loxP binary system in mice. Our studies of the neocortex and cerebellum reveal the expression of a large fraction of known miRNAs with distinct profiles in glutamatergic and GABAergic neurons and subtypes of GABAergic neurons. We further detected putative novel miRNAs, tissue or cell type-specific strand selection of miRNAs, and miRNA editing. Our method thus will facilitate a systematic analysis of miRNA expression and regulation in specific neuron types in the context of neuronal development, physiology, plasticity, pathology, and disease models, and is generally applicable to other cell types and tissues.

H3K4 Demethylation by Jarid1a and Jarid1b Contributes to Retinoblastoma-mediated Gene Silencing During Cellular Senescence

Cellular senescence is a tumor-suppressive program that involves chromatin reorganization and specific changes in gene expression that trigger an irreversible cell-cycle arrest. Here we combine quantitative mass spectrometry, ChIP deep-sequencing, and functional studies to determine the role of histone modifications on chromatin structure and gene-expression alterations associated with senescence in primary human cells. We uncover distinct senescence-associated changes in histone-modification patterns consistent with a repressive chromatin environment and link the establishment of one of these patterns--loss of H3K4 methylation--to the retinoblastoma tumor suppressor and the H3K4 demethylases Jarid1a and Jarid1b. Our results show that Jarid1a/b-mediated H3K4 demethylation contributes to silencing of retinoblastoma target genes in senescent cells, suggesting a mechanism by which retinoblastoma triggers gene silencing. Therefore, we link the Jarid1a and Jarid1b demethylases to a tumor-suppressor network controlling cellular senescence.

Bivalent-like Chromatin Markers Are Predictive for Transcription Start Site Distribution in Human

Deep sequencing of 5' capped transcripts has revealed a variety of transcription initiation patterns, from narrow, focused promoters to wide, broad promoters. Attempts have already been made to model empirically classified patterns, but virtually no quantitative models for transcription initiation have been reported. Even though both genetic and epigenetic elements have been associated with such patterns, the organization of regulatory elements is largely unknown. Here, linear regression models were derived from a pool of regulatory elements, including genomic DNA features, nucleosome organization, and histone modifications, to predict the distribution of transcription start sites (TSS). Importantly, models including both active and repressive histone modification markers, e.g. H3K4me3 and H4K20me1, were consistently found to be much more predictive than models with only single-type histone modification markers, indicating the possibility of "bivalent-like" epigenetic control of transcription initiation. The nucleosome positions are proposed to be coded in the active component of such bivalent-like histone modification markers. Finally, we demonstrated that models trained on one cell type could successfully predict TSS distribution in other cell types, suggesting that these models may have a broader application range.

Chromatin State and MicroRNA Determine Different Gene Expression Dynamics Responsive to TNF Stimulation

Gene expression is a dynamic process, and what factors influence gene expression changes upon external stimulus have not been clearly understood. We studied gene expression profiles in human umbilical vein endothelial cells (HUVEC) after the Tumor Necrosis Factor (TNF) stimulus, and found that: the promoters of fast-response up-regulated genes were enriched with several "active" chromatin markers like H3K27ac and H3K4me3, and also preferentially bound by Pol II and c-Myc; the core-promoter regions of slow-response up-regulated genes were frequently occupied by nucleosomes; down-regulated genes were more intensively regulated by microRNAs. Moreover, the Gene Ontology and motif analysis of the promoter regions revealed that gene clusters with different response behaviors had different functions and were regulated by different sets of transcription factors. Our observations suggested that the different gene expression patterns upon external stimulus were regulated by a combination of multi-layer regulators.

Genome-wide Localization of Protein-DNA Binding and Histone Modification by a Bayesian Change-point Method with ChIP-seq Data

Next-generation sequencing (NGS) technologies have matured considerably since their introduction and a focus has been placed on developing sophisticated analytical tools to deal with the amassing volumes of data. Chromatin immunoprecipitation sequencing (ChIP-seq), a major application of NGS, is a widely adopted technique for examining protein-DNA interactions and is commonly used to investigate epigenetic signatures of diffuse histone marks. These datasets have notoriously high variance and subtle levels of enrichment across large expanses, making them exceedingly difficult to define. Windows-based, heuristic models and finite-state hidden Markov models (HMMs) have been used with some success in analyzing ChIP-seq data but with lingering limitations. To improve the ability to detect broad regions of enrichment, we developed a stochastic Bayesian Change-Point (BCP) method, which addresses some of these unresolved issues. BCP makes use of recent advances in infinite-state HMMs by obtaining explicit formulas for posterior means of read densities. These posterior means can be used to categorize the genome into enriched and unenriched segments, as is customarily done, or examined for more detailed relationships since the underlying subpeaks are preserved rather than simplified into a binary classification. BCP performs a near exhaustive search of all possible change points between different posterior means at high-resolution to minimize the subjectivity of window sizes and is computationally efficient, due to a speed-up algorithm and the explicit formulas it employs. In the absence of a well-established "gold standard" for diffuse histone mark enrichment, we corroborated BCP's island detection accuracy and reproducibility using various forms of empirical evidence. We show that BCP is especially suited for analysis of diffuse histone ChIP-seq data but also effective in analyzing punctate transcription factor ChIP datasets, making it widely applicable for numerous experiment types.

Regulatory Elements of Caenorhabditis Elegans Ribosomal Protein Genes

ABSTRACT: BACKGROUND: Ribosomal protein genes (RPGs) are essential, tightly regulated, and highly expressed during embryonic development and cell growth. Even though their protein sequences are strongly conserved, their mechanism of regulation is not conserved across yeast, Drosophila, and vertebrates. A recent investigation of genomic sequences conserved across both nematode species and associated with different gene groups indicated the existence of several elements in the upstream regions of C. elegans RPGs, providing a new insight regarding the regulation of these genes in C. elegans. RESULTS: In this study, we performed an in-depth examination of C. elegans RPG regulation and found nine highly conserved motifs in the upstream regions of C. elegans RPGs using the motif discovery algorithm DME. Four motifs were partially similar to transcription factor binding sites from C. elegans, Drosophila, yeast, and human. One pair of these motifs was found to co-occur in the upstream regions of 250 transcripts including 22 RPGs. The distance between the two motifs displayed a complex frequency pattern that was related to their relative orientation.We tested the impact of three of these motifs on the expression of rpl-2 using a series of reporter gene constructs and showed that all three motifs are necessary to maintain the high natural expression level of this gene. One of the motifs was similar to the binding site of an orthologue of POP-1, and we showed that RNAi knockdown of pop-1 impacts the expression of rpl-2. We further determined the transcription start site of rpl-2 by 5' RACE and found that the motifs lie 40--90 bases upstream of the start site. We also found evidence that a noncoding RNA, contained within the outron of rpl-2, is co-transcribed with rpl-2 and cleaved during trans-splicing. CONCLUSIONS: Our results indicate that C. elegans RPGs are regulated by a complex novel series of regulatory elements that is evolutionarily distinct from those of all other species examined up until now.

Novel Foxo1-dependent Transcriptional Programs Control T(reg) Cell Function

Regulatory T (T(reg)) cells, characterized by expression of the transcription factor forkhead box P3 (Foxp3), maintain immune homeostasis by suppressing self-destructive immune responses. Foxp3 operates as a late-acting differentiation factor controlling T(reg) cell homeostasis and function, whereas the early T(reg)-cell-lineage commitment is regulated by the Akt kinase and the forkhead box O (Foxo) family of transcription factors. However, whether Foxo proteins act beyond the T(reg)-cell-commitment stage to control T(reg) cell homeostasis and function remains largely unexplored. Here we show that Foxo1 is a pivotal regulator of T(reg )cell function. T(reg) cells express high amounts of Foxo1 and display reduced T-cell-receptor-induced Akt activation, Foxo1 phosphorylation and Foxo1 nuclear exclusion. Mice with T(reg)-cell-specific deletion of Foxo1 develop a fatal inflammatory disorder similar in severity to that seen in Foxp3-deficient mice, but without the loss of T(reg) cells. Genome-wide analysis of Foxo1 binding sites reveals ~300 Foxo1-bound target genes, including the pro-inflammatory cytokine Ifng, that do not seem to be directly regulated by Foxp3. These findings show that the evolutionarily ancient Akt-Foxo1 signalling module controls a novel genetic program indispensable for T(reg) cell function.

Waiting
simple hit counter