Estimating and evaluating the statistics of gapped local-alignment scores
We present a novel maximum-likelihood-based algorithm for estimating the distribution of alignment scores from the scores of unrelated sequences in a database search. Using a new method for measuring the accuracy of p-values, we show that our maximum-likelihood-based algorithm is more accurate than existing regression-based and lookup table methods. We explore a more sophisticated way of modeling and estimating the score distributions (using a two-component mixture model and expectation maximization), but conclude that this does not improve significantly over simply ignoring scores with small E-values during estimation. Finally, we measure the classification accuracy of p-values estimated in different ways and observe that inaccurate p-values can, somewhat paradoxically, lead to higher classification accuracy. We explain this paradox and argue that statistical accuracy, not classification accuracy, should be the primary criterion in comparisons of similarity search methods that return p-values that adjust for target sequence length.
Year of publication: |
2002-01-01
|
---|---|
Authors: | Bailey, T. L. ; Gribskov, M. |
Other Persons: | M.S. Waterman (contributor) ; S. Istrail (contributor) |
Publisher: |
Mary Ann Liebert, Inc. Publishers |
Subject: | Mathematics | Interdisciplinary Applications | Biochemical Research Methods | Biotechnology & Applied Microbiology | Computer Science | Statistics & Probability | Statistics | Sequence Alignment | Homology Search | Evaluation | Sequence | Database |
Saved in:
freely available
Saved in favorites
Similar items by subject
-
Modelling High-Dimensional Data by Mixtures of Factor Analyzers
McLachlan, G. J., (2003)
-
On modifications to the long-term survival mixture model in the presence of competing risks
Ng, SK, (1998)
-
Product form approximations for highly linear loss networks with trunk reservation
Bebbington, M., (2003)
- More ...