15.15: Genome Annotation and Assembly
The genome refers to all of the genetic material in an organism. It can range from a few million base pairs in microbial cells to several billion base pairs in many eukaryotic organisms. Genome assembly refers to the process of taking the DNA sequencing data and putting it all back together in a correct order to create a close representation of the original genome. This is followed by the identification of functional elements on the newly assembled genome, a process called genome annotation.
Genome assembly is a complicated process. While human genomes in a population can have variable gene copy numbers and repeated sequences that add complexity to genome assembly, the physical location of the genes remains constant. In contrast, bacterial genes are not always in the same location, and multiple copies of the same gene may appear in different locations on the genome. This adds complexity to the assembly of the bacterial genomes. Therefore, a single genome assembly from an organism cannot represent all the diversity within the population of a species.
Furthermore, the possibility of technological or algorithmic errors adds further complexity to the process of genome assembly. As a result, many published genomes are continuously updated with the advancement in sequencing technologies as well as assembly and annotation tools. For example, while the first human genome assembly (build 37) was released in 2009, a new version (build 38) was made available in 2013.
Additionally, the evolution of genome annotation tools in the last few decades has increased its resolution. The genome annotation tools have come a long way from just annotating long protein-coding genes and regulatory elements on the genomes to the annotation of sole nucleotides within a population.
Both genome assembly and annotation are essential tools for genome analysis that lead to precise insights into the biology of species, populations, and individuals.