Recently developed latest version of the sequence-directed single-base resolution nucleosome mapping reveals existence of strong nucleosomes and chromatin columnar structures (columns). Broad application of this simple technique for further studies of chromatin and chromosome structure requires some basic understanding as to how it works and what information it affords. The paper provides such an introduction to the method. The oscillating maps of singular nucleosomes, of short and long oligonucleosome columns, are explained, as well as maps of chromatin on satellite DNA and occurrences of counter-phase (antiparallel) nucleosome neighbors.
Recently discovered strong nucleosomes (SNs) characterized by visibly periodical DNA sequences have been found to concentrate in centromeres of Arabidopsis thaliana and in transient meiotic centromeres of Caenorhabditis elegans. To find out whether such affiliation of SNs to centromeres is a more general phenomenon, we studied SNs of the Mus musculus. The publicly available genome sequences of mouse, as well as of practically all other eukaryotes do not include the centromere regions which are difficult to assemble because of a large amount of repeat sequences in the centromeres and pericentromeric regions. We recovered those missing sequences using the data from MNase-seq experiments in mouse embryonic stem cells, where the sequence of DNA inside nucleosomes, including missing regions, was determined by 100-bp paired-end sequencing. Those nucleosome sequences, which are not matching to the published genome sequence, would largely belong to the centromeres. By evaluating SN densities in centromeres and in non-centromeric regions, we conclude that mouse SNs concentrate in the centromeres of telocentric mouse chromosomes, with ~3.9 times excess compared to their density in the rest of the genome. The remaining non-centromeric SNs are harbored mainly by introns and intergenic regions, by retro-transposons, in particular. The centromeric involvement of the SNs opens new horizons for the chromosome and centromere structure studies.
For the computational sequence-directed mapping of the nucleosomes, the knowledge of the nucleosome positioning motifs - 10-11 base long sequences - and respective matrices of bendability, is not sufficient, since there is no justified way to fuse these motifs in one continuous nucleosome DNA sequence. Discovery of the strong nucleosome (SN) DNA sequences, with visible sequence periodicity allows derivation of the full-length nucleosome DNA bendability pattern as matrix or consensus sequence. The SN sequences of three species (A. thaliana, C. elegans, and H. sapiens) are aligned (512 sequences for each species), and long (115 dinucleotides) matrices of bendability derived for the species. The matrices have strong common property - alternation of runs of purine-purine (RR) and pyrimidine-pyrimidine (YY) dinucleotides, with average period 10.4 bases. On this basis the universal [R,Y] consensus of the nucleosome DNA sequence is derived, with exactly defined positions of respective penta- and hexamers RRRRR, RRRRRR, YYYYY, and YYYYYY.
Recently discovered strong nucleosomes (SNs) are characterized by strongly periodical DNA sequence, with visible rather than hidden sequence periodicity. In a quest for possible functions of the SNs, it has been found that the SNs concentrate within centromere regions of A. thaliana chromosomes . They, however, have been detected in Caenorhabditis elegans as well, although the holocentric chromosomes of this species do not have centromeres. Scrutinizing the SNs of C. elegans and their distributions along the DNA sequences of the chromosomes, we have discovered that the SNs are located mainly at the ends of the chromosomes of C. elegans. This suggests that, perhaps, the ends of the chromosomes fulfill some function(s) of centromeres in this species, as also indicated by the cytogenetic studies on meiotic chromosomes in spermatocytes of C. elegans, where the end-to-end association is observed. The centromeric involvement of the SNs, also found in A. thaliana, opens new horizons for the chromosome and centromere structure studies.
Earlier identified strongest nucleosome DNA sequences of A. thaliana, those with visible 10-11 base sequence periodicity, are mapped along chromosomes. Resulting positional distributions reveal distinct maxima, one per chromosome, located in the centromere regions. Sequence-directed nucleosome mapping demonstrates that the strong nucleosomes (SNs) make tight arrays, several parallel nucleosomes each, suggesting a columnar chromatin structure. The SNs represent a new class of centromeric nucleosomes, presumably, participating in synapsis of chromatids and securing the centromere architecture.
Fifteen years ago, Lowary and Widom assembled nucleosomes on synthetic random sequence DNA molecules, selected the strongest nucleosomes and discovered that the TA dinucleotides in these strong nucleosome sequences often appear at 10-11 bases from one another or at distances which are multiples of this period. We repeated this experiment computationally, on large ensembles of natural genomic sequences, by selecting the strongest nucleosomes - i.e. those with such distances between like-named dinucleotides, multiples of 10.4 bases, the structural and sequence period of nucleosome DNA. The analysis confirmed the periodicity of TA dinucleotides in the strong nucleosomes, and revealed as well other periodic sequence elements, notably classical AA and TT dinucleotides. The matrices of DNA bendability and their simple linear forms - nucleosome positioning motifs - are calculated from the strong nucleosome DNA sequences. The motifs are in full accord with nucleosome positioning sequences derived earlier, thus confirming that the new technique, indeed, detects strong nucleosomes. Species- and isochore-specific variations of the matrices and of the positioning motifs are demonstrated. The strong nucleosome DNA sequences manifest the highest hitherto nucleosome positioning sequence signals, showing the dinucleotide periodicities in directly observable rather than in hidden form.
Recent progress in abiotic syntheses, especially self-catalytic syntheses, as well as theoretical breakthroughs such as reconstruction of events of early molecular evolution and tracing repeat expansions in contemporary genomes, converge to a rather simple possible scenario of origin of life, notwithstanding the enormity of the problem. The scenario includes self-replicating RNA duplexes, supplemented by monomers and high-energy compounds that, as demonstrated or assumed, can all be synthesized abiotically. The self-replication would proceed with occasional mutational changes, propagated in later cycles. This audacious, as it may seem, walk toward the life origin already involves many laboratories, each exploring its own scenario. The one suggested in this outline seems to the authors well justified to engage in, while bypassing few steps to deal with later.
We have shown, in a previous paper, that tandem repeating sequences, especially triplet repeats, play a very important role in gene evolution. This result led to the formulation of the following hypothesis: most of the genomic sequences evolved through everlasting acts of tandem repeat expansions with subsequent accumulation of changes. In order to estimate how much of the observed sequences have the repeat origin we describe the adaptation of a text segmentation algorithm, based on dynamic programming, to the mapping of the ancient expansion events. The algorithm maximizes the segmentation cost, calculated as the similarity of obtained fragments to the putative repeat sequence. In the first application of the algorithm to segmentations of genomic sequences, a significant difference between the natural sequences and the corresponding shuffled sequences is detected. The natural fragments are longer and more similar to the putative repeat sequences. As our analysis shows, the coding sequences allow for repeats only when the size of the repeated words is divisible by three. In contrast, in the non-coding sequences, all repeated word sizes are present. It was estimated, that in Escherichia coli K12 genome, about 35.5% of sequence can be detectably traced to original simple repeat ancestors. The results shed light on the genomic sequence organization, and strongly confirm the hypothesis about the crucial role of triplet expansions in gene origin and evolution.
Transcription is known to be affected by the rotational setting of the transcription response elements within nucleosomes. We studied the rotational positioning of the TATA box, the most universal promoter motif. We applied a bioinformatic high-resolution nucleosome mapping technique to eukaryotic promoters. Our results show that the nucleosome DNA sequence harboring the TATA box encodes alternative rotational positions for the same piece of DNA. This may serve for switching the gene activity on and off.
This communication reports on the nucleosome positioning patterns (bendability matrices) for the human genome, derived from over 8_million nucleosome DNA sequences obtained from apoptotically digested lymphocytes. This digestion procedure is used here for the first time for the purpose of extraction and sequencing of the nucleosome DNA fragments. The dominant motifs suggested by the matrices of DNA bendability calculated for light and heavy isochores are significantly different. Both, however, are in full agreement with the linear description YRRRRRYYYYYR, and with earlier derivations by N-gram extensions. Thus, the choice of the nucleosome positioning patterns crucially depends on the G + C composition of the analyzed sequences.
Analysis of the vocabulary of 123 tabulated definitions of life reveals nine groups of defining terms (definientia) of which the groups (self-)reproduction and evolution (variation) appear as the minimal set for a concise and inclusive definition: Life is self-reproduction with variations.
In recent developments in chemistry and genetic engineering, the humble researcher dealing with the origin of life finds her(him)self in a grey area of tackling something that even does not yet have a clear definition agreed upon. A series of chemical steps is described to be considered as the life-nonlife transition, if one adheres to the minimalistic definition: life is self-reproduction with variations. The fully artificial RNA system chosen for the exploration corresponds sequence-wise to the reconstructed initial triplet repeats, presumably corresponding to the earliest protein-coding molecules. The demonstrated occurrence of the mismatches (variations) in otherwise complementary syntheses ("self-reproduction"), in this RNA system, opens an experimental and conceptual perspective to explore the origin of life (and its definition), on the apparent edge of the origin.
An overview is presented on the status of studies on multiple codes in genetic sequences. Indirectly, the existence of multiple codes is recognized in the form of several rediscoveries of Second Genetic Code that is different each time. A due credit is given to earlier seminal work related to the codes often neglected in literature. The latest developments in the field of chromatin code are discussed, as well as perspectives of single-base resolution studies of nucleosome positioning, including rotational setting of DNA on the surface of the histone octamers.
The periodical occurrence of dinucleotides with a period of 10.4 bases now is undeniably a hallmark of nucleosome positioning. Whereas many eukaryotic genomes contain visible and even strong signals for periodic distribution of dinucleotides, the human genome is rather featureless in this respect. The exact sequence features in the human genome that govern the nucleosome positioning remain largely unknown.
Linguistic (word count) analysis of prokaryotic genome sequences, by Shannon N-gram extension, reveals that the dominant hidden motifs in A+T rich genomes are T(A)(T)A and G(A)(T)C with uncertain number of repeating A and T. Since prokaryotic sequences are largely protein-coding, the motifs would correspond to amphipathic alpha-helices with alternating lysine and phenylalanine as preferential polar and non-polar residues. The motifs are also known in eukaryotes, as nucleosome positioning patterns. Their existence in prokaryotes as well may serve for binding of histone-like proteins to DNA. In this case the above patterns in prokaryotes may be considered as "anticipated" nucleosome positioning patterns which, quite likely, existed in prokaryotic genomes before the evolutionary separation between eukaryotes and prokaryotes.
High resolution sequence-directed nucleosome mapping is applied to 36,000 sequences containing splice junctions, from five different species. As it has been also shown in previous studies, the junctions are found to be preferentially located within nucleosomes. Moreover, the orientation of guanine residues at the GT- and AG-ends of introns within the nucleosomes is such that the guanines are positioned nearest to the surface of histone octamers, 3 and 4 bases upstream from the local DNA pseudo-dyads passing through minor grooves oriented outwards. Since the guanine residues are the most vulnerable to spontaneous damage within the cell (primarily, depurination and oxidation) such positioning of the splice junctions minimizes the damage that is caused by free radicals and highly reactive metabolites.
Various aspects of packaging DNA in eukaryotic cells are outlined in physical rather than biological terms. The informational and physical nature of packaging instructions encoded in DNA sequences is discussed with the emphasis on signal processing difficulties--very low signal-to-noise ratio and high degeneracy of the nucleosome positioning signal. As the author has been contributing to the field from its very onset in 1980, the review is mostly focused at the works of the author and his colleagues. The leading concept of the overview is the role of deformational properties of DNA in the nucleosome positioning. The target of the studies is to derive the DNA bendability matrix describing where along the DNA various dinucleotide elements should be positioned, to facilitate its bending in the nucleosome. Three different approaches are described leading to derivation of the DNA deformability sequence pattern, which is a simplified linear presentation of the bendability matrix. All three approaches converge to the same unique sequence motif CGRAAATTTYCG or, in binary form, YRRRRRYYYYYR, both representing the chromatin code.
Horizontal transfer (HT) is the event of a DNA sequence being transferred between species not by inheritance. This phenomenon violates the tree-like evolution of the species under study turning the trees into networks. At the sequence level, HT offers basic characteristics that enable not only clear identification and distinguishing from other sequence similarity cases but also the possibility of dating the events. We developed a novel, self-contained technique to identify relatively recent horizontal transfer elements (HTEs) in the sequences. Appropriate formalism allows one to obtain confidence values for the events detected. The technique does not rely on such problematic prerequisites as reliable phylogeny and/or statistically justified pairwise sequence alignment. In conjunction with the unique properties of HT, it gives rise to a two-level sequence similarity algorithm that, to the best of our knowledge, has not been explored. From evolutionary perspective, the novelty of the work is in the combination of small scale and large scale mutational events. The technique is employed on both simulated and real biological data. The simulation results show high capability of discriminating between HT and conserved regions. On the biological data, the method detected documented HTEs along with their exact locations in the recipient genomes. Supplementary Material is available online at www.libertonline.com/cmb.
Analysis of occurrence of simple amino acid repeats in large ensemble of prokaryotic and eukaryotic sequences reveals that nearly all amino acids found in the repeats belong to those which have in their codon repertoires aggressively expanding triplets, all of three known pathologically expanding classes GCU (GCU, CUG, UGC, AGC, GCA, CAG), GCC (GCC, CCG, CGC, GGC, GCG, CGG), and AAG (AAG, AGA, GAA, CTT, TTC, TCT). This is observed especially clear in the first exons of proteins of higher eukaryotes. The data are interpreted as manifestation of everlasting triplet expansions, which, presumably, started from the very origin of the triplet code. The spontaneous expansions continued to occur all the way during evolution, leaving their footprints in the protein-coding sequences as still visible simple amino acid repeats, as preferred triplets encoding the repeats, and as preferred codons in the codon usage tables.
All major suggestions about the nucleosome positioning sequence pattern(s) are overviewed. Two basic binary periodical patterns are well established: in purine/pyrimidine alphabet - YRRRRRYYYYYR and in strong/weak alphabet -SWWWWWSSSSSW. Their merger in four-letter alphabet sequence coincides with first ever complete matrix of nucleosome DNA bendability derived from very large database of nucleosome DNA sequences. Its simplified linear form is CGGAAATTTCCG. Several independent ways of derivation of the same pattern are described. It appears that the pattern represents an ultimate solution of long-standing problem of nucleosome positioning, and provides simple means for nucleosome mapping on sequences with single-base resolution.
The DNA in eukaryotic cells is packed into the chromatin that is composed of nucleosomes. Positioning of the nucleosome core particles on the sequence is a problem of great interest because of the role nucleosomes play in different cellular processes including gene regulation. Using the sequence structure of 10.4 base DNA repeat presented in our previous works and nucleosome core DNA sequences database, we have derived the complete nucleosome DNA bendability matrix of Caenorhabditis elegans. We have developed a web server named FineStr that allows users to upload genomic sequences in FASTA format and to perform a single-base-resolution nucleosome mapping on them.
DNA deformation in the nucleosome involves partial unstacking between bases and base pairs. By adjusting orientations of different base-pair stacks relative to the histone octamer surface, the optimal set of stacks and their positions is derived, resulting in a sequence pattern, theoretically best suitable for nucleosome DNA. The sequence is very much consistent with available experimental data, thus, suggesting a common eukaryotic nucleosome DNA bendability sequence pattern based exclusively on the very basics of DNA.
It is generally accepted that the organization of eukaryotic DNA into chromatin is strongly governed by a code inherent in the genomic DNA sequence. This code, as well as other codes, is superposed on the triplets coding for amino acids. The history of the chromatin code started three decades ago with the discovery of the periodic appearance of certain dinucleotides, with AA/TT and RR/YY giving the strongest signals, all with a period of 10.4 bases. Every base-pair stack in the DNA duplex has specific deformation properties, thus favoring DNA bending in a specific direction. The appearance of the corresponding dinucleotide at the distance 10.4 xn bases will facilitate DNA bending in that direction, which corresponds to the minimum energy of DNA folding in the nucleosome. We have analyzed the periodic appearances of all 16 dinucleotides in the genomes of thirteen different eukaryotic organisms. Our data show that a large variety of dinucleotides (if not all) are, apparently, contributing to the nucleosome positioning code. The choice of the periodical dinucleotides differs considerably from one organism to another. Among other 10.4 base periodicities, a strong and very regular 10.4 base signal was observed for CG dinucleotides in the genome of the honey bee A. mellifera. Also, the dinucleotide CG appears as the only periodical component in the human genome. This observation seems especially relevant since CpG methylation is well known to modulate chromatin packing and regularity. Thus, the selection of the dinucleotides contributing to the chromatin code is species specific, and may differ from region to region, depending on the sequence context.
A novel approach for evaluation of sequence relatedness via a network over the sequence space is presented. This relatedness is quantified by graph theoretical techniques. The graph is perceived as a flow network, and flow algorithms are applied. The number of independent pathways between nodes in the network is shown to reflect structural similarity of corresponding protein fragments. These results provide an appropriate parameter for quantitative estimation of such relatedness, as well as reliability of the prediction. They also demonstrate a new potential for sequence analysis and comparison by means of the flow network in the sequence space.
Reconstruction of the earliest proteins in the ancient binary alphabet [glycine family G, alanine family A] leads to repeats of G alternating with repeats of A. In addition, omnipresent motifs can be assembled in two of the earliest genes involved in energy supply, crucial for Life, i.e. ATP/GTP binding and ATPase activity. They are an almost perfect match to the alternating G and A and are complementary to each other.
Proteins in their evolution appear to follow several discrete stages, which is reflected in their modular organization. The sequences of the protein modules are highly variable while their functions and structures are rather conserved. The relatedness of the variable sequences is well represented by the networks in natural protein sequence space that also suggests evolutionary connections.
The second parity rule of Chargaff (A?T and G?C within one strand) holds all over the living world with minor exceptions. It is maintained with higher accuracy for long sequences. The question addressed in the article is how different sequence types, with different biases from the parity, contribute to the general effect. It appears that the sequence segments with biases of opposite sign are intermingled, so that with sufficient sequence lengths the parity is established. The parity rule seems to be a cumulative result of a number of independent processes in the genome evolution, with the parity as their intrinsic property. Symmetrical appearance of simple repeats and of Alu sequences in the human DNA strands, and other contributions to the Chargaff parity II rule are discussed.
If we define a genetic code as a widespread DNA sequence pattern that carries a message with an impact on biology, then there are multiple genetic codes. Sequences involved in these codes overlap and, thus, both interact with and constrain each other, such as for the triplet code, the intron-splicing code, the code for amphipathic alpha helices, and the chromatin code. Nucleosomes preferentially are located at the ends of exons, thus protecting splice junctions, with the N9 positions of guanines of the GT and AG junctions oriented toward the histones. Analysis of protein-coding sequences reveals numerous traces of tandem repeats, apparently formed by triplet expansion, which in effect is a genome inflation ``code. Our data are consistent with the hypothesis that expansion of simple tandem repetition of certain aggressive triplets has been a characteristic of life from its emergence. Such expanding triplets appear to be the major factor underlying observed codon usage biases.
Apoptotic digestion of human lymphocyte chromatin results in the appearance of large amounts of nucleosome size DNA fragments. Sequencing of these fragments and analysis of the distribution of bases around the apoptotic nucleases cutting sites revealed a rather strong consensus sequence, not observed earlier. The consensus TAAAgTAcTTTA is characterized by complementary symmetry, resembling prokaryotic restriction sites. This consensus also possesses three TA dinucleotide steps, separated by five bases (corresponding to a half-period of the DNA double helix), suggesting strong bending of the DNA at the cut sites which is perhaps required for cutting.
A novel concept on mechanisms of evolution of genes and genomes is suggested: the sequences evolve largely by local events of triplet expansion and subsequent mutational changes in the repeats. The immediate memory about the earlier expansion events still resides in the sequences, in form of the frequently occurring segments of tandemly repeating codons. Other predicted fossils of the original repeats are: (I) the expanding triplets should be accompanied by their point mutation derivatives and (II) the remaining excess of codons formerly belonging to the tandem repeats should be reflected in overall codon usage biases. Both predictions are confirmed by analysis of largest available database of non-redundant protein coding sequences, of total size ?5?×?10(9) codons. One important conclusion also follows from the results. Life which, presumably, started with replication of expanding triplets and their subsequent mutational changes, is continuing to emerge within the genes and genomes, in form of new events of triplet expansion.
Related JoVE Video
Journal of Visualized Experiments
What is Visualize?
JoVE Visualize is a tool created to match the last 5 years of PubMed publications to methods in JoVE's video library.
How does it work?
We use abstracts found on PubMed and match them to JoVE videos to create a list of 10 to 30 related methods videos.
Video X seems to be unrelated to Abstract Y...
In developing our video relationships, we compare around 5 million PubMed articles to our library of over 4,500 methods videos. In some cases the language used in the PubMed abstracts makes matching that content to a JoVE video difficult. In other cases, there happens not to be any content in our video library that is relevant to the topic of a given abstract. In these cases, our algorithms are trying their best to display videos with relevant content, which can sometimes result in matched videos with only a slight relation.