Data integration via constrained clustering: An application to enzyme clustering

Elisa Boari De Lima, Raquel Cardoso De Melo Minardi, Mohammed Javeed Zaki, Wagner Meira

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

When multiple data sources are available for clustering, an a priori data integration process is usually required. This process may be costly and may not lead to good clusterings, since important information is likely to be discarded. In this paper we propose constrained clustering as a strategy for integrating data sources without losing any information. It basically consists of adding the complementary data sources as constraints that the algorithm must satisfy. As a concrete application of our approach, we focus on the problem of enzyme function prediction, which is a hard task usually performed by intensive experimental work. We use constrained clustering as a means of integrating information from diverse sources as constraints, and analyze how this additional information impacts clustering quality in an enzyme clustering application scenario. Our results show that constraints generally improve the clustering quality when compared to an unconstrained clustering algorithm.

Original languageEnglish
Title of host publicationProceedings of the 11th SIAM International Conference on Data Mining, SDM 2011
Pages83-94
Number of pages12
Publication statusPublished - 1 Dec 2011
Externally publishedYes
Event11th SIAM International Conference on Data Mining, SDM 2011 - Mesa, AZ, United States
Duration: 28 Apr 201130 Apr 2011

Other

Other11th SIAM International Conference on Data Mining, SDM 2011
CountryUnited States
CityMesa, AZ
Period28/4/1130/4/11

    Fingerprint

Keywords

  • Constrained clustering
  • Data integration
  • Enzyme clustering

ASJC Scopus subject areas

  • Software

Cite this

De Lima, E. B., De Melo Minardi, R. C., Zaki, M. J., & Meira, W. (2011). Data integration via constrained clustering: An application to enzyme clustering. In Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011 (pp. 83-94)