Data mining has become a commonly used method for the analysis of organisationaldata, for purposes of summarizing data in useful ways and identifying non-trivialpatterns and relationships in the data. Given the large volumes of data that arecollected by business, government, non-government and scientific researchorganizations, a major challenge for data mining researchers and practitioners is howto select relevant data for analysis in sufficient quantities, in order to meet theobjectives of a data mining task. This thesis addresses the problem of datasetselection for predictive data mining. Dataset selection was studied in the context ofaggregate modeling for classification.
The central argument of this thesis is that, for predictive data mining, it is possible tosystematically select many dataset samples and employ different approaches(different from current practice) to feature selection, training dataset selection, andmodel construction. When a large amount of information in a large dataset is utilisedin the modeling process, the resulting models will have a high level of predictiveperformance and should be more reliable. Aggregate classification models, alsoknown as ensemble classifiers, have been shown to provide a high level of predictiveaccuracy on small datasets. Such models are known to achieve a reduction in thebias and variance components of the prediction error of a model. The research forthis thesis was aimed at the design of aggregate models and the selection of trainingdatasets from large amounts of available data. The objectives for the model designand dataset selection were to reduce the bias and variance components of theprediction error for the aggregate models.
Design science research was adopted as the paradigm for the research. Largedatasets obtained from the UCI KDD Archive were used in the experiments. Twoclassification algorithms: See5 for classification tree modeling and K-NearestNeighbour, were used in the experiments. The two methods of aggregate modelingthat were studied are One-Vs-All (OVA) and positive-Vs-negative (pVn) modeling.While OVA is an existing method that has been used for small datasets, pVn is a newmethod of aggregate modeling, proposed in this thesis. Methods for feature selectionfrom large datasets, and methods for training dataset selection from large datasets,for OVA and pVn aggregate modeling, were studied.
The experiments of feature selection revealed that the use of many samples, robustmeasures of correlation, and validation procedures result in the reliable selection ofrelevant features for classification. A new algorithm for feature subset search, basedon the decision rule-based approach to heuristic search, was designed and the performance of this algorithm was compared to two existing algorithms for feature subset search. The experimental results revealed that the new algorithm makesbetter decisions for feature subset search. The information provided by a confusionmatrix was used as a basis for the design of OVA and pVn base models which aren combined into one aggregate model. A new construct called a confusion graph was used in conjunction with new algorithms for the design of pVn base models. A new algorithm for combining base model predictions and resolving conflicting predictionswas designed and implemented. Experiments to study the performance of the OVA and pVn aggregate models revealed the aggregate models provide a high level of predictive accuracy compared to single models. Finally, theoretical models to depict the relationships between the factors that influence feature selection and training dataset selection for aggregate models are proposed, based on the experimentalresults.
© 2010, University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
Please cite as follows:
Lutu, PEN 2010, Dataset selection of aggregate model implementation in predictive data mining, PhD thesis, University of Pretoria, Pretoria, viewed yymmdd < http://upetd.up.ac.za/thesis/available/etd-11152010-203041/ >
D10/775/gm