Data quality awareness: A case study for cost optimal association rule mining

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

The quality of discovered association rules is commonly evaluated by interestingness measures (commonly support and confidence) with the purpose of supplying indicators to the user in the understanding and use of the new discovered knowledge. Low-quality datasets have a very bad impact over the quality of the discovered association rules, and one might legitimately wonder if a so-called "interesting" rule noted LHS → RHS is meaningful when 30% of the LHS data are not up-to-date anymore, 20% of the RHS data are not accurate, and 15% of the LHS data come from a data source that is well-known for its bad credibility. This paper presents an overview of data quality characterization and management techniques that can be advantageously employed for improving the quality awareness of the knowledge discovery and data mining processes. We propose to integrate data quality indicators for quality aware association rule mining. We propose a cost-based probabilistic model for selecting legitimately interesting rules. Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm our approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.

Original languageEnglish
Pages (from-to)191-215
Number of pages25
JournalKnowledge and Information Systems
Volume11
Issue number2
DOIs
Publication statusPublished - 1 Feb 2007
Externally publishedYes

Fingerprint

Association rules
Data mining
Costs
Experiments

Keywords

  • Association rule mining
  • Cost model
  • Data quality management
  • Data quality metadata
  • Quality awareness mining

ASJC Scopus subject areas

  • Information Systems

Cite this

Data quality awareness : A case study for cost optimal association rule mining. / Berti-Equille, Laure.

In: Knowledge and Information Systems, Vol. 11, No. 2, 01.02.2007, p. 191-215.

Research output: Contribution to journalArticle

@article{3f926cbfd3604697b82c69bc34af7aa1,
title = "Data quality awareness: A case study for cost optimal association rule mining",
abstract = "The quality of discovered association rules is commonly evaluated by interestingness measures (commonly support and confidence) with the purpose of supplying indicators to the user in the understanding and use of the new discovered knowledge. Low-quality datasets have a very bad impact over the quality of the discovered association rules, and one might legitimately wonder if a so-called {"}interesting{"} rule noted LHS → RHS is meaningful when 30{\%} of the LHS data are not up-to-date anymore, 20{\%} of the RHS data are not accurate, and 15{\%} of the LHS data come from a data source that is well-known for its bad credibility. This paper presents an overview of data quality characterization and management techniques that can be advantageously employed for improving the quality awareness of the knowledge discovery and data mining processes. We propose to integrate data quality indicators for quality aware association rule mining. We propose a cost-based probabilistic model for selecting legitimately interesting rules. Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm our approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.",
keywords = "Association rule mining, Cost model, Data quality management, Data quality metadata, Quality awareness mining",
author = "Laure Berti-Equille",
year = "2007",
month = "2",
day = "1",
doi = "10.1007/s10115-006-0006-x",
language = "English",
volume = "11",
pages = "191--215",
journal = "Knowledge and Information Systems",
issn = "0219-1377",
publisher = "Springer London",
number = "2",

}

TY - JOUR

T1 - Data quality awareness

T2 - A case study for cost optimal association rule mining

AU - Berti-Equille, Laure

PY - 2007/2/1

Y1 - 2007/2/1

N2 - The quality of discovered association rules is commonly evaluated by interestingness measures (commonly support and confidence) with the purpose of supplying indicators to the user in the understanding and use of the new discovered knowledge. Low-quality datasets have a very bad impact over the quality of the discovered association rules, and one might legitimately wonder if a so-called "interesting" rule noted LHS → RHS is meaningful when 30% of the LHS data are not up-to-date anymore, 20% of the RHS data are not accurate, and 15% of the LHS data come from a data source that is well-known for its bad credibility. This paper presents an overview of data quality characterization and management techniques that can be advantageously employed for improving the quality awareness of the knowledge discovery and data mining processes. We propose to integrate data quality indicators for quality aware association rule mining. We propose a cost-based probabilistic model for selecting legitimately interesting rules. Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm our approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.

AB - The quality of discovered association rules is commonly evaluated by interestingness measures (commonly support and confidence) with the purpose of supplying indicators to the user in the understanding and use of the new discovered knowledge. Low-quality datasets have a very bad impact over the quality of the discovered association rules, and one might legitimately wonder if a so-called "interesting" rule noted LHS → RHS is meaningful when 30% of the LHS data are not up-to-date anymore, 20% of the RHS data are not accurate, and 15% of the LHS data come from a data source that is well-known for its bad credibility. This paper presents an overview of data quality characterization and management techniques that can be advantageously employed for improving the quality awareness of the knowledge discovery and data mining processes. We propose to integrate data quality indicators for quality aware association rule mining. We propose a cost-based probabilistic model for selecting legitimately interesting rules. Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm our approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.

KW - Association rule mining

KW - Cost model

KW - Data quality management

KW - Data quality metadata

KW - Quality awareness mining

UR - http://www.scopus.com/inward/record.url?scp=33947395207&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33947395207&partnerID=8YFLogxK

U2 - 10.1007/s10115-006-0006-x

DO - 10.1007/s10115-006-0006-x

M3 - Article

AN - SCOPUS:33947395207

VL - 11

SP - 191

EP - 215

JO - Knowledge and Information Systems

JF - Knowledge and Information Systems

SN - 0219-1377

IS - 2

ER -