Genomics
Inferring genome compositional organization
is one of the most important quesitons in genomics. It has been claimed that animal genomes are composed of a mosaic of sequence stretches of
variable lengths that differ widely in their GC compositions. In all animals studied
so far, the distribution of GC-content domain lengths (also known as isochores) was found to follow a heavy-tail distribution with power-law decay
exponents ranging from -1.12 to -1.15. One of the most common ways to describe a genome is by means of the nucleotide distribution, particularly the distribution of GC content. If
complete genomic data is absent, the genomic composition can be deduced from the GC content distributions along short scaffolds of genes and their
flanking regions.
Many of the theories concerning the evolution of isochores are also based on studies that used GC3 as a predictor for isochore composition or that simply assumed the existence of isochores. I showed that these theories cannot be discussed without further analysis of genomic data (Elhaik et al. 2009). Moreover, approaches to the issue of compositional and structural organization should not assume the existance of isochores in a genome but should rather be based on a model that has been tested on complete genomic sequences. These and other findings and the relationship between genomic composition and methylation appeard in Elhaik and Tatarinova (2012) Because different segmentation algorithm yields different results (Elhaik et al. 2010a), due to the the user input, which drastically affect the results - it is necessary to use a paramtere-free algorithm which does not wield to the user and considers only the sequence. Such an algorithm was developed by (Elhaik et al. 2010b) (see IsoPlotter) and was used to develop a new model for compositional genomics that would explain the findings which Bernardi's isochore model failed to explain. In Elhaik and Graur (2014) we carried an extensive analysis of genome composition. In this paper, we have: 1. Refuted the Isochore theory (hopefully for the last time). 2. Provided a detailed description of mammalian genome landscape. 3. We invalidated the clade Euarchontoglires (Murids are closer to Primates). Our findings depict the mammalian genome as a tapestry of mostly short homogeneous and nonhomogeneous domains and few long ones thus providing strong evidence in favor of the compositional domain model. |
Analysing genome compositional organization
is an essential step before analyzing any new genome.
We modeled the genome organization and developed phylogenetic applications that can help in deciding the position of a species in a tree. The Compositional Domain Model we developed was found to be consistent in describing the genome organization of eutherian and other species but was also useful for genomic comparisons. Shortly after its introduction, the Compositional Domain Model became the choice model for genomic analyses and has been applied in numerous genome sequencing projects such as the honeybee (Sodergren et al.2006a) and its newest build (Sodergren et al.2006a), sea urchin (Sodergren et al. 2006b), red flour beetle (Richards et al. 2008), cow (Gibbs et al. 2009), nasonia (Warren et al. 2010), body louse (Kirkness et al. 2010), and many ant genomes, such as the red harvester ant (Smith et al. 2011), invasive Argentine ant (Smith et al. 2011), and the Leaf-Cutter Ant (Suen et al. 2011). and others (Suen et al. 2011). This work was widely covered by the media (see Press). |
Analyzing base composition of DNA
is important to the understanding of genome organization.
The nucleotide composition of genomes varies dramatically between and among taxa. The GC content is the
primary measure to characterize genomic regions in terms of homogeneity, compositional bias, and compositional constraints.
Zhang and Zhang (1991) proposed the Z-curve, an extension to the GC content measure, based on a three coordinate system of x, y, and z
and a derived measure, namely the "genome order index" defined as S = a^2 + c^2 + t^2 + g^2, where a, c, t, and g are the
nucleotide frequencies of A, C, T, and G, respectively. The fact that the numerical value of S is smaller than 1/3 for
almost all DNA sequences of 809 genomes have been erroneously interpreted as supporting evidence for the existence of
genome-specific constraints on nucleotide composition of naturally occurring DNA, i.e., isochores.
We studied the Z-curve method and the "genome order index" purported by their developers to be useful measures. In two consequent papers Elhaik et al. (2008) and Elhaik et al. (2010) showed that these calculations are in error and that the Z-curve suffers from over dimensionality, as the Z dimension, stands for GC content, suffices to represent any given genome. This work establised the importance of utilizing the GC content to study genome composition and organization. |