Interpreting correlations in biosequences
Understanding the complex organization of genomes as well as predicting the location of genes and the possible structure of the gene products are some of the most important problems in current molecular biology. Many statistical techniques are used to address these issues. A central role among them play correlation functions. This paper is based on an analysis of the decay of the entire 4×4 dimensional covariance matrix of DNA sequences. We apply this covariance analysis to human chromosomal regions, yeast DNA, and bacterial genomes and interpret the three most pronounced statistical features – long-range correlations, a period 3, and a period 10–11 – using known biological facts about the structure of genomes. For example, we relate the slowly decaying long-range G+C correlations to dispersed repeats and CpG islands. We show quantitatively that the 3-basepair-periodicity is due to the nonuniformity of the codon usage in protein coding segments. We finally show that periodicities of 10–11 basepairs in yeast DNA originate from an alternation of hydrophobic and hydrophilic amino acids in protein sequences.
Year of publication: |
1998
|
---|---|
Authors: | Herzel, H ; Trifonov, E.N ; Weiss, O ; Große, I |
Published in: |
Physica A: Statistical Mechanics and its Applications. - Elsevier, ISSN 0378-4371. - Vol. 249.1998, 1, p. 449-459
|
Publisher: |
Elsevier |
Subject: | Correlation function | DNA sequence | Genetic code | Protein sequence | Hydrophobicity |
Saved in:
Online Resource
Saved in favorites
Similar items by subject
-
A novel method for similarity/dissimilarity analysis of protein sequences
Mu, Zengchao, (2013)
-
3D graphical representation of protein sequences and their statistical characterization
Abo el Maaty, Moheb I., (2010)
-
Primordial synthesis machines and the origin of the genetic code
Aldana, M., (1998)
- More ...