Temporal rules discovery for web data cleaning

Ziawasch Abedjan, Cuneyt G. Akcora, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker

Research output: Chapter in Book/Report/Conference proceedingChapter

21 Citations (Scopus)

Abstract

Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a "clean" version of the data. To support domain experts,in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques have traditionally ignored the time dimension. Recurrent events, such as persons reported in locations, have a duration in which they are valid, and this duration should be part of the rules or the cleaning process would simply fail. In this work, we study the rule discovery problem for temporal web data. Such a discovery process is challenging because of the nature of web data; extracted facts are (i) sparse over time, (ii) reported with delays, and (iii) often reported with errors over the values because of inaccurate sources or non robust extractors. We handle these challenges with a new discovery approach that is more robust to noise. Our solution uses machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself. Our experimental evaluation over real-world data from Recorded Future, an intelligence company that monitors over 700K Web sources, shows that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40% relative increase in the average F-measure.

Original languageEnglish
Title of host publicationProceedings of the VLDB Endowment
PublisherAssociation for Computing Machinery
Pages336-347
Number of pages12
Volume9
Edition4
Publication statusPublished - 2016
Event42nd International Conference on Very Large Data Bases, VLDB 2016 - Delhi, India
Duration: 5 Sep 20169 Sep 2016

Other

Other42nd International Conference on Very Large Data Bases, VLDB 2016
CountryIndia
CityDelhi
Period5/9/169/9/16

Fingerprint

Cleaning
Learning systems
Repair
Industry

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Abedjan, Z., Akcora, C. G., Ouzzani, M., Papotti, P., & Stonebraker, M. (2016). Temporal rules discovery for web data cleaning. In Proceedings of the VLDB Endowment (4 ed., Vol. 9, pp. 336-347). Association for Computing Machinery.

Temporal rules discovery for web data cleaning. / Abedjan, Ziawasch; Akcora, Cuneyt G.; Ouzzani, Mourad; Papotti, Paolo; Stonebraker, Michael.

Proceedings of the VLDB Endowment. Vol. 9 4. ed. Association for Computing Machinery, 2016. p. 336-347.

Research output: Chapter in Book/Report/Conference proceedingChapter

Abedjan, Z, Akcora, CG, Ouzzani, M, Papotti, P & Stonebraker, M 2016, Temporal rules discovery for web data cleaning. in Proceedings of the VLDB Endowment. 4 edn, vol. 9, Association for Computing Machinery, pp. 336-347, 42nd International Conference on Very Large Data Bases, VLDB 2016, Delhi, India, 5/9/16.
Abedjan Z, Akcora CG, Ouzzani M, Papotti P, Stonebraker M. Temporal rules discovery for web data cleaning. In Proceedings of the VLDB Endowment. 4 ed. Vol. 9. Association for Computing Machinery. 2016. p. 336-347
Abedjan, Ziawasch ; Akcora, Cuneyt G. ; Ouzzani, Mourad ; Papotti, Paolo ; Stonebraker, Michael. / Temporal rules discovery for web data cleaning. Proceedings of the VLDB Endowment. Vol. 9 4. ed. Association for Computing Machinery, 2016. pp. 336-347
@inbook{b779946dde534f858a8b1f415393a204,
title = "Temporal rules discovery for web data cleaning",
abstract = "Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a {"}clean{"} version of the data. To support domain experts,in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques have traditionally ignored the time dimension. Recurrent events, such as persons reported in locations, have a duration in which they are valid, and this duration should be part of the rules or the cleaning process would simply fail. In this work, we study the rule discovery problem for temporal web data. Such a discovery process is challenging because of the nature of web data; extracted facts are (i) sparse over time, (ii) reported with delays, and (iii) often reported with errors over the values because of inaccurate sources or non robust extractors. We handle these challenges with a new discovery approach that is more robust to noise. Our solution uses machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself. Our experimental evaluation over real-world data from Recorded Future, an intelligence company that monitors over 700K Web sources, shows that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40{\%} relative increase in the average F-measure.",
author = "Ziawasch Abedjan and Akcora, {Cuneyt G.} and Mourad Ouzzani and Paolo Papotti and Michael Stonebraker",
year = "2016",
language = "English",
volume = "9",
pages = "336--347",
booktitle = "Proceedings of the VLDB Endowment",
publisher = "Association for Computing Machinery",
edition = "4",

}

TY - CHAP

T1 - Temporal rules discovery for web data cleaning

AU - Abedjan, Ziawasch

AU - Akcora, Cuneyt G.

AU - Ouzzani, Mourad

AU - Papotti, Paolo

AU - Stonebraker, Michael

PY - 2016

Y1 - 2016

N2 - Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a "clean" version of the data. To support domain experts,in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques have traditionally ignored the time dimension. Recurrent events, such as persons reported in locations, have a duration in which they are valid, and this duration should be part of the rules or the cleaning process would simply fail. In this work, we study the rule discovery problem for temporal web data. Such a discovery process is challenging because of the nature of web data; extracted facts are (i) sparse over time, (ii) reported with delays, and (iii) often reported with errors over the values because of inaccurate sources or non robust extractors. We handle these challenges with a new discovery approach that is more robust to noise. Our solution uses machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself. Our experimental evaluation over real-world data from Recorded Future, an intelligence company that monitors over 700K Web sources, shows that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40% relative increase in the average F-measure.

AB - Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a "clean" version of the data. To support domain experts,in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques have traditionally ignored the time dimension. Recurrent events, such as persons reported in locations, have a duration in which they are valid, and this duration should be part of the rules or the cleaning process would simply fail. In this work, we study the rule discovery problem for temporal web data. Such a discovery process is challenging because of the nature of web data; extracted facts are (i) sparse over time, (ii) reported with delays, and (iii) often reported with errors over the values because of inaccurate sources or non robust extractors. We handle these challenges with a new discovery approach that is more robust to noise. Our solution uses machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself. Our experimental evaluation over real-world data from Recorded Future, an intelligence company that monitors over 700K Web sources, shows that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40% relative increase in the average F-measure.

UR - http://www.scopus.com/inward/record.url?scp=84976512265&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84976512265&partnerID=8YFLogxK

M3 - Chapter

VL - 9

SP - 336

EP - 347

BT - Proceedings of the VLDB Endowment

PB - Association for Computing Machinery

ER -