Data quality awareness: A case study for cost optimal association rule mining

Research output: Contribution to journalArticle

14 Citations (Scopus)

Abstract

The quality of discovered association rules is commonly evaluated by interestingness measures (commonly support and confidence) with the purpose of supplying indicators to the user in the understanding and use of the new discovered knowledge. Low-quality datasets have a very bad impact over the quality of the discovered association rules, and one might legitimately wonder if a so-called "interesting" rule noted LHS → RHS is meaningful when 30% of the LHS data are not up-to-date anymore, 20% of the RHS data are not accurate, and 15% of the LHS data come from a data source that is well-known for its bad credibility. This paper presents an overview of data quality characterization and management techniques that can be advantageously employed for improving the quality awareness of the knowledge discovery and data mining processes. We propose to integrate data quality indicators for quality aware association rule mining. We propose a cost-based probabilistic model for selecting legitimately interesting rules. Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm our approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.

Original languageEnglish
Pages (from-to)191-215
Number of pages25
JournalKnowledge and Information Systems
Volume11
Issue number2
DOIs
Publication statusPublished - 1 Feb 2007
Externally publishedYes

    Fingerprint

Keywords

  • Association rule mining
  • Cost model
  • Data quality management
  • Data quality metadata
  • Quality awareness mining

ASJC Scopus subject areas

  • Information Systems

Cite this