In an isogenic cell population, phenotypic heterogeneity among individual cells is common and critical for survival of the population under different environment conditions. DNA modification is an important epigenetic factor that can regulate phenotypic heterogeneity. The single molecule real-time (SMRT) sequencing technology provides a unique platform for detecting a wide range of DNA modifications, including N6-methyladenine (6-mA), N4-methylcytosine (4-mC) and 5-methylcytosine (5-mC). Here we present qDNAmod, a novel bioinformatic tool for genome-wide quantitative profiling of intercellular heterogeneity of DNA modification from SMRT sequencing data. It is capable of estimating proportion of isogenic haploid cells, in which the same loci of the genome are differentially modified. We tested the reliability of qDNAmod with the SMRT sequencing data of Streptococcus pneumoniae strain ST556. qDNAmod detected extensive intercellular heterogeneity of DNA methylation (6-mA) in a clonal population of ST556. Subsequent biochemical analyses revealed that the recognition sequences of two type I restriction-modification (R-M) systems are responsible for the intercellular heterogeneity of DNA methylation initially identified by qDNAmod. qDNAmod thus represents a valuable tool for studying intercellular phenotypic heterogeneity from genome-wide DNA modification.
Drosophila Dscam1 is a cell-surface protein that plays important roles in neural development and axon tiling of neurons. It is known that thousands of isoforms bind themselves through specific homophilic interactions, a process which provides the basis for cellular self-recognition. Detailed biochemical studies of specific isoforms strongly suggest that homophilic binding, i.e. the formation of homodimers by identical Dscam1 isomers, is of great importance for the self-avoidance of neurons. Due to experimental limitations, it is currently impossible to measure the homophilic binding affinities for all 19,000 potential isoforms.
Autosomal dominant types of nonsyndromic hearing loss (ADNSHL) are typically postlingual in onset and progressive. High genetic heterogeneity, late onset age, and possible confounding due to nongenetic factors hinder the timely molecular diagnoses for most patients. In this study, exome sequencing was applied to investigate a large Chinese family segregating ADNSHL in which we initially failed to find strong evidence of linkage to any locus by whole-genome linkage analysis. Two affected family members were selected for sequencing. We identified two novel mutations disrupting known ADNSHL genes and shared by the sequenced samples: c.328C>A in COCH (DFNA9) resulting in a p.Q110K substitution and a deletion c. 2814_2815delAA in MYO6 (DFNA22) causing a frameshift alteration p.R939Tfs*2. The pathogenicity of novel coding variants in ADNSHL genes was carefully evaluated by analysis of co-segregation with phenotype in the pedigree and in light of established genotype-phenotype correlations. The frameshift deletion in MYO6 was confirmed as the causative variant for this pedigree, whereas the missense mutation in COCH had no clinical significance. The results allowed us to retrospectively identify the phenocopy in one patient that contributed to the negative finding in the linkage scan. Our clinical data also supported the emerging genotype-phenotype correlation for DFNA22.
We describe a large four-generational Chinese pedigree segregating MYH9 -related disease caused by a V1516L mutation. The clinical findings supported previously established genotype-phenotype correlations, and also demonstrated interindividual variability of disease manifestations even within the same family. The same mutation was previously reported in another Chinese pedigree but resulting from a different DNA substitution. Analyzing the patterns of previously reported mutations revealed a limited spectrum of pathogenic variants. The implications of this finding are discussed.
Here, we report an unconventional Chinese pedigree consisting of three branches all segregating prelingual hearing loss (HL) with unclear inheritance pattern. After identifying the cause of one branch as maternally inherited aminoglycoside-induced HL, targeted next generation sequencing (NGS) was applied to identify the genetic causes for the other two branches. One affected subject from each branch was subject to targeted NGS whose genomic DNA was enriched either by whole-exome capture (Agilent SureSelect All Exon 50?Mb) or by candidate genes capture (Agilent SureSelect custom kit). By NGS analysis, we identified that patients from Branch A were compound heterozygous for p.E1006K and p.D1663V in the CDH23 (DFNB12) gene; and patients from Branch B were homozygous for IVS7-2A>G in the SLC26A4 (DFNB4) gene. Both CDH23 mutations altered conserved calcium binding sites of the extracellular cadherin domains. The co-occurrence of three different genetic causes in this family was exceedingly rare but fully compatible with the mutation spectrum of HL. Our study has also raised several technical and analytical issues when applying the NGS technique to genetic testing.Journal of Human Genetics advance online publication, 18 September 2014; doi:10.1038/jhg.2014.78.
?With the advances of RNA sequencing technologies, scientists need new tools to analyze transcriptome data. We introduce RNAseqViewer, a new visualization tool dedicated to RNA-Seq data. The program offers innovative ways to represent transcriptome data for single or multiple samples. It is a handy tool for scientists who use RNA-Seq data to compare multiple transcriptomes, for example, to compare gene expression and alternative splicing of cancer samples or of different development stages.Availability and implementation:?RNAseqViewer is freely available for academic use at http://bioinfo.au.tsinghua.edu.cn/software/RNAseqViewer/ CONTACT: ?firstname.lastname@example.org SUPPLEMENTARY INFORMATION: ?Supplementary data are available at Bioinformatics online.
Rhodococcus opacus strain PD630 (R. opacus PD630), is an oleaginous bacterium, and also is one of few prokaryotic organisms that contain lipid droplets (LDs). LD is an important organelle for lipid storage but also intercellular communication regarding energy metabolism, and yet is a poorly understood cellular organelle. To understand the dynamics of LD using a simple model organism, we conducted a series of comprehensive omics studies of R. opacus PD630 including complete genome, transcriptome and proteome analysis. The genome of R. opacus PD630 encodes 8947 genes that are significantly enriched in the lipid transport, synthesis and metabolic, indicating a super ability of carbon source biosynthesis and catabolism. The comparative transcriptome analysis from three culture conditions revealed the landscape of gene-altered expressions responsible for lipid accumulation. The LD proteomes further identified the proteins that mediate lipid synthesis, storage and other biological functions. Integrating these three omics uncovered 177 proteins that may be involved in lipid metabolism and LD dynamics. A LD structure-like protein LPD06283 was further verified to affect the LD morphology. Our omics studies provide not only a first integrated omics study of prokaryotic LD organelle, but also a systematic platform for facilitating further prokaryotic LD research and biofuel development.
Recent studies have found many antisense non-coding transcripts at the opposite strand of some protein-coding genes. In yeast, it was reported that such antisense transcripts play regulatory roles for their partner genes by forming a feedback loop with the protein-coding genes. Since not all coding genes have accompanying antisense transcripts, it would be interesting to know whether there are sequence signatures in a coding gene that are decisive or associated with the existence of such antisense partners. We collected all the annotated antisense transcripts in the yeast Saccharomyces cerevisiae, analyzed sequence motifs around the genes with antisense partners, and classified genes with and without accompanying antisense transcripts by using machine learning methods. Some weak but statistically significant sequence features are detected, which indicates that there are sequence signatures around the protein-coding genes that may be decisive or indicative for the existence of accompanying antisense transcripts.
Cancer is a genomic disease associated with a plethora of gene mutations resulting in a loss of control over vital cellular functions. Among these mutated genes, driver genes are defined as being causally linked to oncogenesis, while passenger genes are thought to be irrelevant for cancer development. With increasing numbers of large-scale genomic datasets available, integrating these genomic data to identify driver genes from aberration regions of cancer genomes becomes an important goal of cancer genome analysis and investigations into mechanisms responsible for cancer development. A computational method, MAXDRIVER, is proposed here to identify potential driver genes on the basis of copy number aberration (CNA) regions of cancer genomes, by integrating publicly available human genomic data. MAXDRIVER employs several optimization strategies to construct a heterogeneous network, by means of combining a fused gene functional similarity network, gene-disease associations and a disease phenotypic similarity network. MAXDRIVER was validated to effectively recall known associations among genes and cancers. Previously identified as well as novel driver genes were detected by scanning CNAs of breast cancer, melanoma and liver carcinoma. Three predicted driver genes (CDKN2A, AKT1, RNF139) were found common in these three cancers by comparative analysis.
Mapping short reads to the reference genome is very often the prerequisite for applications utilizing the next-generation sequencing technologies. A dozen of software tools developed for this purpose have been widely used. But many practical issues remained when utilizing them to build a computational pipeline for downstream analyses. In this chapter, we describe the read mapping procedures adopted in our lab for the exome sequencing studies as an example to illustrate those practical details.
Differential gene expression (DGE) analysis is commonly used to reveal the deregulated molecular mechanisms of complex diseases. However, traditional DGE analysis (e.g., the t test or the rank sum test) tests each gene independently without considering interactions between them. Top-ranked differentially regulated genes prioritized by the analysis may not directly relate to the coherent molecular changes underlying complex diseases. Joint analyses of co-expression and DGE have been applied to reveal the deregulated molecular modules underlying complex diseases. Most of these methods consist of separate steps: first to identify gene-gene relationships under the studied phenotype then to integrate them with gene expression changes for prioritizing signature genes, or vice versa. It is warrant a method that can simultaneously consider gene-gene co-expression strength and corresponding expression level changes so that both types of information can be leveraged optimally.
RNA-Seq technology has been used widely in transcriptome study, and one of the most important applications is to estimate the expression level of genes and their alternative splicing isoforms. There have been several algorithms published to estimate the expression based on different models. Recently Wu et al. published a method that can accurately estimate isoform level expression by considering position-related sequencing biases using nonparametric models. The method has advantages in handling different read distributions, but there hasnt been an efficient program to implement this algorithm.
Though most of the transcripts are long non-coding RNAs (lncRNAs), little is known about their functions. lncRNAs usually function through interactions with proteins, which implies the importance of identifying the binding proteins of lncRNAs in understanding the molecular mechanisms underlying the functions of lncRNAs. Only a few approaches are available for predicting interactions between lncRNAs and proteins. In this study, we introduce a new method lncPro.
The genetic make-up of humans and other mammals (such as mice) affects their resistance to influenza virus infection. Considering the complexity and moral issues associated with experiments on human subjects, we have only acquired partial knowledge regarding the underlying molecular mechanisms. Although influenza resistance in inbred mice has been mapped to several quantitative trait loci (QTLs), which have greatly narrowed down the search for host resistance genes, only few underlying genes have been identified.
Comparison and classification of metagenome samples is one of the major tasks in the study of microbial communities of natural environments or niches on human bodies. Bioinformatics methods play important roles on this task, including 16S rRNA gene analysis and some alignment-based or alignment-free methods on metagenomic data. Alignment-free methods have the advantage of not depending on known genome annotations and therefore have high potential in studying complicated microbiomes. However, the existing alignment-free methods are all based on unsupervised learning strategy (e.g., PCA or hierarchical clustering). These types of methods are powerful in revealing major similarities and grouping relations between microbiome samples, but cannot be applied for discriminating predefined classes of interest which might not be the dominating assortment in the data. Supervised classification is needed in the latter scenario, with the goal of classifying samples into predefined classes and finding the features that can discriminate the classes. The effectiveness of supervised classification with alignment-based features on metagenomic data have been shown in some recent studies. The application of alignment-free supervised classification methods on metagenome data has not been well explored yet.
DNA modifications such as methylation and DNA damage can play critical regulatory roles in biological systems. Single molecule, real time (SMRT) sequencing technology generates DNA sequences as well as DNA polymerase kinetic information that can be used for the direct detection of DNA modifications. We demonstrate that local sequence context has a strong impact on DNA polymerase kinetics in the neighborhood of the incorporation site during the DNA synthesis reaction, allowing for the possibility of estimating the expected kinetic rate of the enzyme at the incorporation site using kinetic rate information collected from existing SMRT sequencing data (historical data) covering the same local sequence contexts of interest. We develop an Empirical Bayesian hierarchical model for incorporating historical data. Our results show that the model could greatly increase DNA modification detection accuracy, and reduce requirement of control data coverage. For some DNA modifications that have a strong signal, a control sample is not even needed by using historical data as alternative to control. Thus, sequencing costs can be greatly reduced by using the model. We implemented the model in a R package named seqPatch, which is available at https://github.com/zhixingfeng/seqPatch.
Chromatin immunoprecipitation combined with the next-generation DNA sequencing technologies (ChIP-seq) becomes a key approach for detecting genome-wide sets of genomic sites bound by proteins, such as transcription factors (TFs). Several methods and open-source tools have been developed to analyze ChIP-seq data. However, most of them are designed for detecting TF binding regions instead of accurately locating transcription factor binding sites (TFBSs). It is still challenging to pinpoint TFBSs directly from ChIP-seq data, especially in regions with closely spaced binding events.
Lysine acetylation is a well-studied post-translational modification on both histone and nonhistone proteins. More than 2000 acetylated proteins and 4000 lysine acetylation sites have been identified by large scale mass spectrometry or traditional experimental methods. Although over 20 lysine (K)-acetyl-transferases (KATs) have been characterized, which KAT is responsible for a given protein or lysine site acetylation is mostly unknown. In this work, we collected KAT-specific acetylation sites manually and analyzed sequence features surrounding the acetylated lysine of substrates from three main KAT families (CBP/p300, GCN5/PCAF, and the MYST family). We found that each of the three KAT families acetylates lysines with different sequence features. Based on these differences, we developed a computer program, Acetylation Set Enrichment Based method to predict which KAT-families are responsible for acetylation of a given protein or lysine site. Finally, we evaluated the efficiency of our method, and experimentally detected four proteins that were predicted to be acetylated by two KAT families when one representative member of the KAT family is over expressed. We conclude that our approach, combined with more traditional experimental methods, may be useful for identifying KAT families responsible for acetylated substrates proteome-wide.
Analysis of expression quantitative trait loci (eQTL) aims to identify the genetic loci associated with the expression level of genes. Penalized regression with a proper penalty is suitable for the high-dimensional biological data. Its performance should be enhanced when we incorporate biological knowledge of gene expression network and linkage disequilibrium (LD) structure between loci in high-noise background.
Sequence alignment depends on the scoring function that defines similarity between pairs of letters. For local alignment, the computational algorithm searches for the most similar segments in the sequences according to the scoring function. The choice of this scoring function is important for correctly detecting segments of interest. We formulate sequence alignment as a hypothesis testing problem, and conduct extensive simulation experiments to study the relationship between the scoring function and the distribution of aligned pairs within the aligned segment under this framework. We cut through the many ways to construct scoring functions and showed that any scoring function with negative expectation used in local alignment corresponds to a hypothesis test between the background distribution of sequence letters and a statistical distribution of letter pairs determined by the scoring function. The results indicate that the log-likelihood ratio scoring function is statistically most powerful and has the highest accuracy for detecting the segments of interest that are defined by the statistical distribution of aligned letter pairs.
High-throughput RNA sequencing (RNA-seq) technology provides a revolutionary approach to studying splicing events de novo. However, identifying splice junctions with high sensitivity and specificity remains a challenge. In the present study, we proposed a new tool named SeqSaw to detect splice junctions with or without the canonical GT-AG splicing signal. SeqSaw was applied to two ENCODE RNA-seq datasets and also compared with two existing methods. It was shown that the proposed method obtained better results on finding novel splice junctions. Experiments also revealed that the current sequencing depth has not yet reached saturation to detect novel transcripts. Moreover, by comparing the number of supporting reads, we demonstrated that many un-annotated splicing events can be tissue specific.
We conducted a meta-analysis of genome-wide association studies of systolic (SBP) and diastolic (DBP) blood pressure in 19,608 subjects of east Asian ancestry from the AGEN-BP consortium followed up with de novo genotyping (n = 10,518) and further replication (n = 20,247) in east Asian samples. We identified genome-wide significant (P < 5 × 10(-8)) associations with SBP or DBP, which included variants at four new loci (ST7L-CAPZA1, FIGN-GRB14, ENPEP and NPR3) and a newly discovered variant near TBX3. Among the five newly discovered variants, we obtained significant replication in the independent samples for all of these loci except NPR3. We also confirmed seven loci previously identified in populations of European descent. Moreover, at 12q24.13 near ALDH2, we observed strong association signals (P = 7.9 × 10(-31) and P = 1.3 × 10(-35) for SBP and DBP, respectively) with ethnic specificity. These findings provide new insights into blood pressure regulation and potential targets for intervention.
RNA-Seq technology based on next-generation sequencing provides the unprecedented ability of studying transcriptomes at high resolution and accuracy, and the potential of measuring expression of multiple isoforms from the same gene at high precision. Solved by maximum likelihood estimation, isoform expression can be inferred in RNA-Seq using statistical models based on the assumption that sequenced reads are distributed uniformly along transcripts. Modification of the model is needed when considering situations where RNA-Seq data do not follow uniform distribution.
Due to its unprecedented high-resolution and detailed information, RNA-seq technology based on next-generation high-throughput sequencing significantly boosts the ability to study transcriptomes. The estimation of genes transcript abundance levels or gene expression levels has always been an important question in research on the transcriptional regulation and gene functions. On the basis of the concept of Reads Per Kilo-base per Million reads (RPKM), taking the union-intersection genes (UI-based) and summing up inferred isoform abundance (isoform-based) are the two current strategies to estimate gene expression levels, but produce different estimations. In this paper, we made the first attempt to compare the two strategies performances through a series of simulation studies. Our results showed that the isoform-based method gives not only more accurate estimation but also has less uncertainty than the UI-based strategy. If taking into account the non-uniformity of read distribution, the isoform-based method can further reduce estimation errors. We applied both strategies to real RNA-seq datasets of technical replicates, and found that the isoform-based strategy also displays a better performance. For a more accurate estimation of gene expression levels from RNA-seq data, even if the abundance levels of isoforms are not of interest, it is still better to first infer the isoform abundance and sum them up to get the expression level of a gene as a whole.
Lipid rafts are specialized cholesterol-enriched microdomains in the cell membrane. They have been known as a platform for protein-protein interactions and to take part in multiple biological processes. Nevertheless, how lipid rafts influence protein properties at the proteomic level is still an open question for researchers using traditional biochemical approaches. Here, by annotating the lipid raft localization of proteins in human protein-protein interaction networks, we performed a systematic analysis of the function of proteins related to lipid rafts. Our results demonstrated that lipid raft proteins and their interactions were critical for the structure and stability of the whole network, and that the interactions between them were significantly enriched. Furthermore, for each protein in the network, we calculated its "lipid raft dependency (LRD)," which indicates how close it is topologically associated with lipid rafts, and we then uncovered the connection between LRD and protein functions. Proteins with high LRD tended to be essential for mammalian development, and malfunction of these proteins was inclined to cause human diseases. Coordinated with their neighbors, high-LRD proteins participated in multiple biological processes and targeted many pathways in diseases pathogenesis. High-LRD proteins were also found to have tissue specificity of expression. In summary, our network-based analysis denotes that lipid raft proteins have higher centrality in the network, and that lipid-raft-related proteins have multiple functions and are probably concerned with many biological processes in disease development.
Proliferation of liver cells can be observed in hepatocarcinogenesis, at different stages of liver development, and during liver regeneration after an injury. Does it imply that they share similar molecular mechanisms? Here, the transcriptional profiles of hepatocellular carcinoma (HCC), liver development, and liver regeneration were systematically compared as a preliminary attempt to answer this question. From the comparison, we found that advanced HCC mimics early development in terms of deprived normal liver functions and activated cellular proliferation, but advanced HCC and early development differ in expressions of cancer-related genes and their transcriptional controls. HCC and liver regeneration demonstrate different expression patterns as a whole, but regeneration is similar to dysplasia (pre-stage of HCC) in terms of their proximity to the normal state. In summary, of these three important processes, the carcinogenic progress carries the highest variance in expression; HCC pre-stage shares some resemblance with liver regeneration; and advanced HCC stage displays similarity with early development.
Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis.
High-throughput RNA sequencing (RNA-seq) is rapidly emerging as a major quantitative transcriptome profiling platform. Here, we present DEGseq, an R package to identify differentially expressed genes or isoforms for RNA-seq data from different samples. In this package, we integrated three existing methods, and introduced two novel methods based on MA-plot to detect and visualize gene expression difference.
The liver performs a number of essential functions for life. The development of such a complex organ relies on finely regulated gene expression profiles which change over time in the development and determine the phenotype and function of the liver. We used high-density oligonucleotide microarrays to study the gene expression and transcription regulation at 14 time points across the C57/B6 mouse liver development, which include E11.5 (embryonic day 11.5), E12.5, E13.5, E14.5, E15.5, E16.5, E17.5, E18.5, Day0 (the day of birth), Day3, Day7, Day14, Day21, and normal adult liver. With these data, we made a comprehensive analysis on gene expression patterns, functional preferences and transcriptional regulations during the liver development. A group of uncharacterized genes which might be involved in the fetal hematopoiesis were detected.
MicroRNAs (miRNAs) are important post-transcriptional regulators that repress gene expression by binding to the 3UTRs of their target mRNAs. There are two main outcomes for the transcripts targeted by miRNAs: mRNA degradation and translational repression. It is still unclear what factors determine whether a target transcript is degraded or translationally repressed. In this study, we collected two classes of genes that are targeted by miR-1, miR-155, miR-16, miR-30a, and let-7b and built new computational models with machine-learning methods to predict the fates of target genes based on sequence features. The prediction results indicate that the sequence context of the miRNA binding site at the 3UTR of a target gene plays an important role in determining how an miRNA regulates the expression of its target. Further analysis shows that four out of the five studied miRNAs probably share similar regulatory mechanisms on their target genes.
Increasing evidence points to the importance of aberrant O-glycosylated immunoglobulin A1 (IgA1) in the pathogenesis of IgA nephropathy (IgAN), a disease widely considered to be a polygenic disorder. We earlier found that haplotypes in two key glycosyltransferase genes, C1GALT1 and ST6GALNAC2, were associated with susceptibility to IgAN. Here we measured the genetic interaction of variants in C1GALT1 and ST6GALNAC2 by applying FAMHAP software to analyze haplotype-haplotype interaction in IgAN. As confirmation, we also used a novel divergence-based multi-locus algorithm (DBMA) approach to determine interactions between single-nucleotide polymorphisms. Haplotype-haplotype combinations in C1GALT1 and ST6GALNAC2 were significantly associated with a predisposition for IgAN and with the estimated glomerular filtration rate (eGFR) of patients. Analogously, results from DBMA found a five-locus combination, two in ST6GALNAC2 and three in C1GALT1, which was associated with IgAN predisposition, eGFR, and renal outcome of patients with IgAN. In addition, patients with a high risk had significantly more exposed N-acetylgalactosamine on their IgA1 than did patients with a lower risk of developing this disease. Our findings suggest that potential genetic interactions of C1GALT1 and ST6GALNAC2 variants influence IgA1 O-glycosylation, disease predisposition, and disease severity, and may contribute to the polygenic nature of IgAN.
The study on DNA methylation pattern in different human tissues attracts increasing interest nowadays, but a systematic analysis of CpG island methylation pattern between both somatic tissues and gametocyte is still lacking. In this work, we analyzed the CpG island methylation data of sperm and other 11 somatic tissues from Human Epigenome Project, and found that the CpG island methylation profiles are highly correlated between somatic tissues, while the methylation profile in sperm is quite distinct. Furthermore, we observed that in the six tissues investigated, there is no obvious correlation between the methylation level of promoter CpG islands and corresponding gene expression across different tissues.
Transcriptional regulation plays key roles in many biological processes. The regulation is dynamic in time and space. Identifying transcription factors that play major roles in a developmental time course is very important for understanding the regulation. This cannot be realized by studying the relation between the expression of individual genes. We developed a gene-set analysis approach to study master regulators and their actively regulated targets during a time course from gene expression data. We applied the method to a mouse liver development data and a mouse embryonic stem cell (mESC) development data, and identified 14 and 9 transcription factors that play major regulatory roles in the two development courses, respectively. Some transcription factors could not be identified as active in the process by studying their correlation with individual targets. The method was also extended for studying other regulation factors or pathways from time-course expression data.
Hepatocellular carcinoma (HCC) is one of the most deadly malignancies worldwide. Scientists have been studying the molecular mechanism of HCC for years, but the understanding of it remains incomplete and scattered across the literature at different molecular levels. Chromosomal aberrations, epigenetic abnormality and changes of gene expression have been reported in HCC. High-throughput omics technologies have been widely applied, aiming at the discovery of candidate biomarkers for cancer staging, prediction of recurrence and prognosis, and treatment selection. Large amounts of data on genetic and epigenetic abnormalities, gene expression profiles, microRNA expression profiles and proteomics have been accumulating, and bioinformatics is playing a more and more important role. In this paper, we review the current omics-based studies on HCC at the levels of genomics, transcriptomics and proteomics. Integrating observations from multiple aspects is an essential step toward the systematic understanding of the disease.
Genotoxicity models are extremely important to assess retroviral vector biosafety before gene therapy. We have developed an in utero model that demonstrates that hepatocellular carcinoma (HCC) development is restricted to mice receiving nonprimate (np) lentiviral vectors (LV) and does not occur when a primate (p) LV is used regardless of woodchuck post-translation regulatory element (WPRE) mutations to prevent truncated X gene expression. Analysis of 839 npLV and 244 pLV integrations in the liver genomes of vector-treated mice revealed clear differences between vector insertions in gene dense regions and highly expressed genes, suggestive of vector preference for insertion or clonal outgrowth. In npLV-associated clonal tumors, 56% of insertions occurred in oncogenes or genes associated with oncogenesis or tumor suppression and surprisingly, most genes examined (11/12) had reduced expression as compared with control livers and tumors. Two examples of vector-inserted genes were the Park 7 oncogene and Uvrag tumor suppressor gene. Both these genes and their known interactive partners had differential expression profiles. Interactive partners were assigned to networks specific to liver disease and HCC via ingenuity pathway analysis. The fetal mouse model not only exposes the genotoxic potential of vectors intended for gene therapy but can also reveal genes associated with liver oncogenesis.
Sequence signatures, as defined by the frequencies of k-tuples (or k-mers, k-grams), have been used extensively to compare genomic sequences of individual organisms, to identify cis-regulatory modules, and to study the evolution of regulatory sequences. Recently many next-generation sequencing (NGS) read data sets of metagenomic samples from a variety of different environments have been generated. The assembly of these reads can be difficult and analysis methods based on mapping reads to genes or pathways are also restricted by the availability and completeness of existing databases. Sequence-signature-based methods, however, do not need the complete genomes or existing databases and thus, can potentially be very useful for the comparison of metagenomic samples using NGS read data. Still, the applications of sequence signature methods for the comparison of metagenomic samples have not been well studied.
Recent study revealed that most human genes have alternative splicing and can produce multiple isoforms of transcripts. Differences in the relative abundance of the isoforms of a gene can have significant biological consequences. Identifying genes that are differentially spliced between two groups of RNA-sequencing samples is an important basic task in the study of transcriptomes with next-generation sequencing technology. We use the negative binomial (NB) distribution to model sequencing reads on exons, and propose a NB-statistic to detect differentially spliced genes between two groups of samples by comparing read counts on all exons. The method opens a new exon-based approach instead of isoform-based approach for the task. It does not require information about isoform composition, nor need the estimation of isoform expression. Experiments on simulated data and real RNA-seq data of human kidney and liver samples illustrated the methods good performance and applicability. It can also detect previously unknown alternative splicing events, and highlight exons that are most likely differentially spliced between the compared samples. We developed an NB-statistic method that can detect differentially spliced genes between two groups of samples without using a prior knowledge on the annotation of alternative splicing. It does not need to infer isoform structure or to estimate isoform expression. It is a useful method designed for comparing two groups of RNA-seq samples. Besides identifying differentially spliced genes, the method can highlight on the exons that contribute the most to the differential splicing. We developed a software tool called DSGseq for the presented method available at http://bioinfo.au.tsinghua.edu.cn/software/DSGseq.
Tongue diagnosis is a unique method in traditional Chinese medicine (TCM). This is the first investigation on the association between traditional tongue diagnosis and the tongue coating microbiome using next-generation sequencing. The study included 19 gastritis patients with a typical white-greasy or yellow-dense tongue coating corresponding to TCM Cold or Hot Syndrome respectively, as well as eight healthy volunteers. An Illumina paired-end, double-barcode 16S rRNA sequencing protocol was designed to profile the tongue-coating microbiome, from which approximately 3.7 million V6 tags for each sample were obtained. We identified 123 and 258 species-level OTUs that were enriched in patients with Cold/Hot Syndromes, respectively, representing "Cold Microbiota" and "Hot Microbiota". We further constructed the tongue microbiota-imbalanced networks associated with Cold/Hot Syndromes. The results reveal an important connection between the tongue-coating microbiome and traditional tongue diagnosis, and illustrate the potential of the tongue-coating microbiome as a novel holistic biomarker for characterizing patient subtypes.
The biogenesis, development and metastases of cancer are associated with many variations in the transcriptome. Alternative splicing of genes is a major post-transcriptional regulation mechanism that is involved in many types of cancer. The next-generation sequencing applied on RNAs (RNA-Seq) provides a new technology for studying transcriptomes. It provides an unprecedented opportunity for quantitatively studying alternative splicing in a systematic way. This mini-review summarizes the current RNA-Seq studies on cancer transcriptomes especially studies on cancer-related alternative splicing, and discusses the strategy for quantitative study of alternative splicing in cancers with RNA-Seq, the bioinformatics methods available and existing questions.
Current generation DNA sequencing instruments are moving closer to seamlessly sequencing genomes of entire populations as a routine part of scientific investigation. However, while significant inroads have been made identifying small nucleotide variation and structural variations in DNA that impact phenotypes of interest, progress has not been as dramatic regarding epigenetic changes and base-level damage to DNA, largely due to technological limitations in assaying all known and unknown types of modifications at genome scale. Recently, single-molecule real time (SMRT) sequencing has been reported to identify kinetic variation (KV) events that have been demonstrated to reflect epigenetic changes of every known type, providing a path forward for detecting base modifications as a routine part of sequencing. However, to date no statistical framework has been proposed to enhance the power to detect these events while also controlling for false-positive events. By modeling enzyme kinetics in the neighborhood of an arbitrary location in a genomic region of interest as a conditional random field, we provide a statistical framework for incorporating kinetic information at a test position of interest as well as at neighboring sites that help enhance the power to detect KV events. The performance of this and related models is explored, with the best-performing model applied to plasmid DNA isolated from Escherichia coli and mitochondrial DNA isolated from human brain tissue. We highlight widespread kinetic variation events, some of which strongly associate with known modification events, while others represent putative chemically modified sites of unknown types.
Genome-wide association studies have identified the ATP2B1 gene associated with blood pressure (BP), the evidence from large scale Chinese population was still rare. We performed the current replication study to test the association of the ATP2B1 gene and hypertension and BP in two unrelated Chinese cohorts including 2,831 unrelated individuals with hypertension and 1,987 controls in total. We also examined the influences of the ATP2B1 gene on the arterial stiffness through evaluation of carotid-femoral pulse wave velocities (cf-PWV) in 164 untreated hypertensives. The major findings of this study were that four loci--rs10858911, rs2681472, rs17249754 and rs1401982--associated with any or all of four traits: hypertension (P = 0.001-4.6E-05; odds ratio, 0.83-0.87), systolic BP (P = 0.003-0.004), diastolic BP (P = 0.002-0.003) and cf-PWV (P = 0.002-0.004). All the comparisons were adjusted for sex, age, age(2) and body mass index. We validated the association of the ATP2B1 gene and susceptibility to hypertension, BP traits and cf-PWV in Chinese population. In addition, further genetic and functional research was warranted to elucidate the concrete locus in the ATP2B1 gene that influenced the manifestation of BP and vascular function.
To understand the roles they play in complex diseases, genes need to be investigated in the networks they are involved in. Integration of gene expression and network data is a promising approach to prioritize disease-associated genes. Some methods have been developed in this field, but the problem is still far from being solved.
Related JoVE Video
Journal of Visualized Experiments
What is Visualize?
JoVE Visualize is a tool created to match the last 5 years of PubMed publications to methods in JoVE's video library.
How does it work?
We use abstracts found on PubMed and match them to JoVE videos to create a list of 10 to 30 related methods videos.
Video X seems to be unrelated to Abstract Y...
In developing our video relationships, we compare around 5 million PubMed articles to our library of over 4,500 methods videos. In some cases the language used in the PubMed abstracts makes matching that content to a JoVE video difficult. In other cases, there happens not to be any content in our video library that is relevant to the topic of a given abstract. In these cases, our algorithms are trying their best to display videos with relevant content, which can sometimes result in matched videos with only a slight relation.