Learn2Clean: Optimizing the sequence of tasks for web data preparation

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Data cleaning and preparation has been a long-standing challenge in data science to avoid incorrect results and misleading conclusions obtained from dirty data. For a given dataset and a given machine learning-based task, a plethora of data preprocessing techniques and alternative data curation strategies may lead to dramatically different outputs with unequal quality performance. Most current work on data cleaning and automated machine learning, however, focus on developing either cleaning algorithms or user-guided systems or argue to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance of. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.

Original languageEnglish
Title of host publicationThe Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019
PublisherAssociation for Computing Machinery, Inc
Pages2580-2586
Number of pages7
ISBN (Electronic)9781450366748
DOIs
Publication statusPublished - 13 May 2019
Event2019 World Wide Web Conference, WWW 2019 - San Francisco, United States
Duration: 13 May 201917 May 2019

Publication series

NameThe Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019

Conference

Conference2019 World Wide Web Conference, WWW 2019
CountryUnited States
CitySan Francisco
Period13/5/1917/5/19

Fingerprint

Cleaning
Learning systems
Reinforcement learning

Keywords

  • Data cleaning
  • Principled data preprocessing
  • Q-Learning
  • Reinforcement learning

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Cite this

Berti-Equille, L. (2019). Learn2Clean: Optimizing the sequence of tasks for web data preparation. In The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019 (pp. 2580-2586). (The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019). Association for Computing Machinery, Inc. https://doi.org/10.1145/3308558.3313602

Learn2Clean : Optimizing the sequence of tasks for web data preparation. / Berti-Equille, Laure.

The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019. Association for Computing Machinery, Inc, 2019. p. 2580-2586 (The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Berti-Equille, L 2019, Learn2Clean: Optimizing the sequence of tasks for web data preparation. in The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019. The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019, Association for Computing Machinery, Inc, pp. 2580-2586, 2019 World Wide Web Conference, WWW 2019, San Francisco, United States, 13/5/19. https://doi.org/10.1145/3308558.3313602
Berti-Equille L. Learn2Clean: Optimizing the sequence of tasks for web data preparation. In The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019. Association for Computing Machinery, Inc. 2019. p. 2580-2586. (The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019). https://doi.org/10.1145/3308558.3313602
Berti-Equille, Laure. / Learn2Clean : Optimizing the sequence of tasks for web data preparation. The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019. Association for Computing Machinery, Inc, 2019. pp. 2580-2586 (The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019).
@inproceedings{755e34057a3e4eb7813f9686a1fc08bb,
title = "Learn2Clean: Optimizing the sequence of tasks for web data preparation",
abstract = "Data cleaning and preparation has been a long-standing challenge in data science to avoid incorrect results and misleading conclusions obtained from dirty data. For a given dataset and a given machine learning-based task, a plethora of data preprocessing techniques and alternative data curation strategies may lead to dramatically different outputs with unequal quality performance. Most current work on data cleaning and automated machine learning, however, focus on developing either cleaning algorithms or user-guided systems or argue to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance of. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.",
keywords = "Data cleaning, Principled data preprocessing, Q-Learning, Reinforcement learning",
author = "Laure Berti-Equille",
year = "2019",
month = "5",
day = "13",
doi = "10.1145/3308558.3313602",
language = "English",
series = "The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019",
publisher = "Association for Computing Machinery, Inc",
pages = "2580--2586",
booktitle = "The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019",

}

TY - GEN

T1 - Learn2Clean

T2 - Optimizing the sequence of tasks for web data preparation

AU - Berti-Equille, Laure

PY - 2019/5/13

Y1 - 2019/5/13

N2 - Data cleaning and preparation has been a long-standing challenge in data science to avoid incorrect results and misleading conclusions obtained from dirty data. For a given dataset and a given machine learning-based task, a plethora of data preprocessing techniques and alternative data curation strategies may lead to dramatically different outputs with unequal quality performance. Most current work on data cleaning and automated machine learning, however, focus on developing either cleaning algorithms or user-guided systems or argue to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance of. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.

AB - Data cleaning and preparation has been a long-standing challenge in data science to avoid incorrect results and misleading conclusions obtained from dirty data. For a given dataset and a given machine learning-based task, a plethora of data preprocessing techniques and alternative data curation strategies may lead to dramatically different outputs with unequal quality performance. Most current work on data cleaning and automated machine learning, however, focus on developing either cleaning algorithms or user-guided systems or argue to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance of. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.

KW - Data cleaning

KW - Principled data preprocessing

KW - Q-Learning

KW - Reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=85066883046&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85066883046&partnerID=8YFLogxK

U2 - 10.1145/3308558.3313602

DO - 10.1145/3308558.3313602

M3 - Conference contribution

AN - SCOPUS:85066883046

T3 - The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019

SP - 2580

EP - 2586

BT - The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019

PB - Association for Computing Machinery, Inc

ER -