Principled data preprocessing: Application to biological aquatic indicators of water pollution

Eva C.Serrano Balderas, Laure Berti-Equille, Ma Aurora Armienta Hernández, Corinne Grac

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, it is necessary to control not only the whole analytical pipeline, but most importantly the quality of the data with appropriate data preprocessing choices. Since various preprocessing techniques and alternative strategies may lead to dramatically different outputs, it is crucial to rely on a principled and rigorous method to select the optimal set of data preprocessing steps that depends both on the input data distributional characteristics and on the inherent characteristics of the targeted statistical or data mining methods. In this paper, we propose a method that selects, given a dataset, the optimal set of preprocessing tasks to apply to the data such that the overall data preprocessing output maximizes the quality of the analytical results for various techniques of clustering, regression, and classification. We present some promising results that validate our approach on biomonitoring data preparation.

Original languageEnglish
Title of host publicationProceedings - 28th International Workshop on Database and Expert Systems Applications, DEXA 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages52-56
Number of pages5
Volume2017-August
ISBN (Electronic)9781538610510
DOIs
Publication statusPublished - 25 Sep 2017
Event28th International Workshop on Database and Expert Systems Applications, DEXA 2017 - Lyon, France
Duration: 28 Aug 201731 Aug 2017

Other

Other28th International Workshop on Database and Expert Systems Applications, DEXA 2017
CountryFrance
CityLyon
Period28/8/1731/8/17

Fingerprint

Water pollution
Data mining
Pipelines

Keywords

  • Biological data preprocessing
  • Biomonitoring data
  • Data cleaning

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Balderas, E. C. S., Berti-Equille, L., Hernández, M. A. A., & Grac, C. (2017). Principled data preprocessing: Application to biological aquatic indicators of water pollution. In Proceedings - 28th International Workshop on Database and Expert Systems Applications, DEXA 2017 (Vol. 2017-August, pp. 52-56). [8049685] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/DEXA.2017.27

Principled data preprocessing : Application to biological aquatic indicators of water pollution. / Balderas, Eva C.Serrano; Berti-Equille, Laure; Hernández, Ma Aurora Armienta; Grac, Corinne.

Proceedings - 28th International Workshop on Database and Expert Systems Applications, DEXA 2017. Vol. 2017-August Institute of Electrical and Electronics Engineers Inc., 2017. p. 52-56 8049685.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Balderas, ECS, Berti-Equille, L, Hernández, MAA & Grac, C 2017, Principled data preprocessing: Application to biological aquatic indicators of water pollution. in Proceedings - 28th International Workshop on Database and Expert Systems Applications, DEXA 2017. vol. 2017-August, 8049685, Institute of Electrical and Electronics Engineers Inc., pp. 52-56, 28th International Workshop on Database and Expert Systems Applications, DEXA 2017, Lyon, France, 28/8/17. https://doi.org/10.1109/DEXA.2017.27
Balderas ECS, Berti-Equille L, Hernández MAA, Grac C. Principled data preprocessing: Application to biological aquatic indicators of water pollution. In Proceedings - 28th International Workshop on Database and Expert Systems Applications, DEXA 2017. Vol. 2017-August. Institute of Electrical and Electronics Engineers Inc. 2017. p. 52-56. 8049685 https://doi.org/10.1109/DEXA.2017.27
Balderas, Eva C.Serrano ; Berti-Equille, Laure ; Hernández, Ma Aurora Armienta ; Grac, Corinne. / Principled data preprocessing : Application to biological aquatic indicators of water pollution. Proceedings - 28th International Workshop on Database and Expert Systems Applications, DEXA 2017. Vol. 2017-August Institute of Electrical and Electronics Engineers Inc., 2017. pp. 52-56
@inproceedings{079807ba2e9b4b05b1081ed7af642417,
title = "Principled data preprocessing: Application to biological aquatic indicators of water pollution",
abstract = "In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, it is necessary to control not only the whole analytical pipeline, but most importantly the quality of the data with appropriate data preprocessing choices. Since various preprocessing techniques and alternative strategies may lead to dramatically different outputs, it is crucial to rely on a principled and rigorous method to select the optimal set of data preprocessing steps that depends both on the input data distributional characteristics and on the inherent characteristics of the targeted statistical or data mining methods. In this paper, we propose a method that selects, given a dataset, the optimal set of preprocessing tasks to apply to the data such that the overall data preprocessing output maximizes the quality of the analytical results for various techniques of clustering, regression, and classification. We present some promising results that validate our approach on biomonitoring data preparation.",
keywords = "Biological data preprocessing, Biomonitoring data, Data cleaning",
author = "Balderas, {Eva C.Serrano} and Laure Berti-Equille and Hern{\'a}ndez, {Ma Aurora Armienta} and Corinne Grac",
year = "2017",
month = "9",
day = "25",
doi = "10.1109/DEXA.2017.27",
language = "English",
volume = "2017-August",
pages = "52--56",
booktitle = "Proceedings - 28th International Workshop on Database and Expert Systems Applications, DEXA 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Principled data preprocessing

T2 - Application to biological aquatic indicators of water pollution

AU - Balderas, Eva C.Serrano

AU - Berti-Equille, Laure

AU - Hernández, Ma Aurora Armienta

AU - Grac, Corinne

PY - 2017/9/25

Y1 - 2017/9/25

N2 - In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, it is necessary to control not only the whole analytical pipeline, but most importantly the quality of the data with appropriate data preprocessing choices. Since various preprocessing techniques and alternative strategies may lead to dramatically different outputs, it is crucial to rely on a principled and rigorous method to select the optimal set of data preprocessing steps that depends both on the input data distributional characteristics and on the inherent characteristics of the targeted statistical or data mining methods. In this paper, we propose a method that selects, given a dataset, the optimal set of preprocessing tasks to apply to the data such that the overall data preprocessing output maximizes the quality of the analytical results for various techniques of clustering, regression, and classification. We present some promising results that validate our approach on biomonitoring data preparation.

AB - In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, it is necessary to control not only the whole analytical pipeline, but most importantly the quality of the data with appropriate data preprocessing choices. Since various preprocessing techniques and alternative strategies may lead to dramatically different outputs, it is crucial to rely on a principled and rigorous method to select the optimal set of data preprocessing steps that depends both on the input data distributional characteristics and on the inherent characteristics of the targeted statistical or data mining methods. In this paper, we propose a method that selects, given a dataset, the optimal set of preprocessing tasks to apply to the data such that the overall data preprocessing output maximizes the quality of the analytical results for various techniques of clustering, regression, and classification. We present some promising results that validate our approach on biomonitoring data preparation.

KW - Biological data preprocessing

KW - Biomonitoring data

KW - Data cleaning

UR - http://www.scopus.com/inward/record.url?scp=85039149026&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85039149026&partnerID=8YFLogxK

U2 - 10.1109/DEXA.2017.27

DO - 10.1109/DEXA.2017.27

M3 - Conference contribution

AN - SCOPUS:85039149026

VL - 2017-August

SP - 52

EP - 56

BT - Proceedings - 28th International Workshop on Database and Expert Systems Applications, DEXA 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -