Improving fuzzy clustering of biological data by metric learning with side information

Michele Ceccarelli, Antonio Maratea

Research output: Contribution to journalArticle

19 Citations (Scopus)

Abstract

Semi Supervised methods use a small amount of auxiliary information as a guide in the learning process in presence of unlabeled data. When using a clustering algorithm, the auxiliary information has the form of Side Information, that is a list of co-clustered points. Recent literature shows better performance of these methods with respect to totally unsupervised ones even with a small amount of Side Information. This fact suggests that the use of Semi Supervised methods may be useful especially in very difficult and noisy tasks where little a priori information is available, as is the case of data deriving from biological experiments. The two more frequently used paradigms to include Side Information into clustering are Constrained Clustering and Metric Learning. In this paper we use a Metric Learning approach as a way to improve the classical fuzzy c-means clustering through a two steps procedure: first a series of metrics (one for each cluster) that satisfy a randomly generated set of constraints are learnt based on the data; then a generalized version of the fuzzy c-means (with the metrics computed in the previous step) is executed. We show the benefits and the limitations of this method using real world datasets and a modified version of the Partition Entropy index.

Original languageEnglish
Pages (from-to)45-57
Number of pages13
JournalInternational Journal of Approximate Reasoning
Volume47
Issue number1
DOIs
Publication statusPublished - 1 Jan 2008
Externally publishedYes

Fingerprint

Side Information
Fuzzy clustering
Fuzzy Clustering
Clustering algorithms
Entropy
Metric
Auxiliary Information
Experiments
Clustering
Fuzzy C-means Clustering
Fuzzy C-means
Learning Process
Clustering Algorithm
Paradigm
Partition
Series
Learning
Experiment

Keywords

  • Adaptive metric
  • Fuzzy clustering
  • Semi Supervised learning
  • Simulated annealing
  • Validity index

ASJC Scopus subject areas

  • Statistics and Probability
  • Electrical and Electronic Engineering
  • Statistics, Probability and Uncertainty
  • Information Systems and Management
  • Information Systems
  • Computer Science Applications
  • Artificial Intelligence

Cite this

Improving fuzzy clustering of biological data by metric learning with side information. / Ceccarelli, Michele; Maratea, Antonio.

In: International Journal of Approximate Reasoning, Vol. 47, No. 1, 01.01.2008, p. 45-57.

Research output: Contribution to journalArticle

@article{4bbd72dcb5fe4559bc4605df673a7ead,
title = "Improving fuzzy clustering of biological data by metric learning with side information",
abstract = "Semi Supervised methods use a small amount of auxiliary information as a guide in the learning process in presence of unlabeled data. When using a clustering algorithm, the auxiliary information has the form of Side Information, that is a list of co-clustered points. Recent literature shows better performance of these methods with respect to totally unsupervised ones even with a small amount of Side Information. This fact suggests that the use of Semi Supervised methods may be useful especially in very difficult and noisy tasks where little a priori information is available, as is the case of data deriving from biological experiments. The two more frequently used paradigms to include Side Information into clustering are Constrained Clustering and Metric Learning. In this paper we use a Metric Learning approach as a way to improve the classical fuzzy c-means clustering through a two steps procedure: first a series of metrics (one for each cluster) that satisfy a randomly generated set of constraints are learnt based on the data; then a generalized version of the fuzzy c-means (with the metrics computed in the previous step) is executed. We show the benefits and the limitations of this method using real world datasets and a modified version of the Partition Entropy index.",
keywords = "Adaptive metric, Fuzzy clustering, Semi Supervised learning, Simulated annealing, Validity index",
author = "Michele Ceccarelli and Antonio Maratea",
year = "2008",
month = "1",
day = "1",
doi = "10.1016/j.ijar.2007.03.008",
language = "English",
volume = "47",
pages = "45--57",
journal = "International Journal of Approximate Reasoning",
issn = "0888-613X",
publisher = "Elsevier Inc.",
number = "1",

}

TY - JOUR

T1 - Improving fuzzy clustering of biological data by metric learning with side information

AU - Ceccarelli, Michele

AU - Maratea, Antonio

PY - 2008/1/1

Y1 - 2008/1/1

N2 - Semi Supervised methods use a small amount of auxiliary information as a guide in the learning process in presence of unlabeled data. When using a clustering algorithm, the auxiliary information has the form of Side Information, that is a list of co-clustered points. Recent literature shows better performance of these methods with respect to totally unsupervised ones even with a small amount of Side Information. This fact suggests that the use of Semi Supervised methods may be useful especially in very difficult and noisy tasks where little a priori information is available, as is the case of data deriving from biological experiments. The two more frequently used paradigms to include Side Information into clustering are Constrained Clustering and Metric Learning. In this paper we use a Metric Learning approach as a way to improve the classical fuzzy c-means clustering through a two steps procedure: first a series of metrics (one for each cluster) that satisfy a randomly generated set of constraints are learnt based on the data; then a generalized version of the fuzzy c-means (with the metrics computed in the previous step) is executed. We show the benefits and the limitations of this method using real world datasets and a modified version of the Partition Entropy index.

AB - Semi Supervised methods use a small amount of auxiliary information as a guide in the learning process in presence of unlabeled data. When using a clustering algorithm, the auxiliary information has the form of Side Information, that is a list of co-clustered points. Recent literature shows better performance of these methods with respect to totally unsupervised ones even with a small amount of Side Information. This fact suggests that the use of Semi Supervised methods may be useful especially in very difficult and noisy tasks where little a priori information is available, as is the case of data deriving from biological experiments. The two more frequently used paradigms to include Side Information into clustering are Constrained Clustering and Metric Learning. In this paper we use a Metric Learning approach as a way to improve the classical fuzzy c-means clustering through a two steps procedure: first a series of metrics (one for each cluster) that satisfy a randomly generated set of constraints are learnt based on the data; then a generalized version of the fuzzy c-means (with the metrics computed in the previous step) is executed. We show the benefits and the limitations of this method using real world datasets and a modified version of the Partition Entropy index.

KW - Adaptive metric

KW - Fuzzy clustering

KW - Semi Supervised learning

KW - Simulated annealing

KW - Validity index

UR - http://www.scopus.com/inward/record.url?scp=36249016903&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=36249016903&partnerID=8YFLogxK

U2 - 10.1016/j.ijar.2007.03.008

DO - 10.1016/j.ijar.2007.03.008

M3 - Article

VL - 47

SP - 45

EP - 57

JO - International Journal of Approximate Reasoning

JF - International Journal of Approximate Reasoning

SN - 0888-613X

IS - 1

ER -