Nearly homogeneous multi-partitioning with a deterministic generator

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

The need for homogeneous partitions, where all parts have the same distribution, is ubiquitous in machine learning and in other fields of scientific studies. Especially when only few partitions can be generated. In that case, validation sets need to be distributed the same way as training sets to get good estimates of models' complexities. And when standard data analysis tools cannot deal with too large data sets, the analysis could be performed onto a smaller subset, as far as its homogeneity to the larger one is good enough to get relevant results. However, pseudo-random generators may generate partitions whose parts have very different distributions because the geometry of the data is ignored. In this work, we propose an algorithm which deterministically generates partitions whose parts have empirically greater homogeneity on average than parts arising from pseudo-random partitions. The data to partition are seriated based on a nearest neighbor rule, and assigned to a part of the partition according to their rank in this seriation. We demonstrate the efficiency of this algorithm on toys and real data sets. Since this algorithm is deterministic, it also provides a way to make reproducible machine learning experiments usually based on pseudo-random partitions.

Original languageEnglish
Pages (from-to)1379-1389
Number of pages11
JournalNeurocomputing
Volume72
Issue number7-9
DOIs
Publication statusPublished - Mar 2009
Externally publishedYes

Fingerprint

Learning systems
Play and Playthings
Geometry
Experiments
Machine Learning
Datasets

Keywords

  • Deterministic sampling
  • Distribution
  • Divergence
  • Homogeneity measure
  • Homogeneous partition
  • Multi-partition
  • Nearest neighbor
  • Random sampling
  • Reproducibility
  • Seriation

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Cognitive Neuroscience

Cite this

Nearly homogeneous multi-partitioning with a deterministic generator. / Aupetit, Michael.

In: Neurocomputing, Vol. 72, No. 7-9, 03.2009, p. 1379-1389.

Research output: Contribution to journalArticle

@article{0868888bfc2548338fa97024e442eb9f,
title = "Nearly homogeneous multi-partitioning with a deterministic generator",
abstract = "The need for homogeneous partitions, where all parts have the same distribution, is ubiquitous in machine learning and in other fields of scientific studies. Especially when only few partitions can be generated. In that case, validation sets need to be distributed the same way as training sets to get good estimates of models' complexities. And when standard data analysis tools cannot deal with too large data sets, the analysis could be performed onto a smaller subset, as far as its homogeneity to the larger one is good enough to get relevant results. However, pseudo-random generators may generate partitions whose parts have very different distributions because the geometry of the data is ignored. In this work, we propose an algorithm which deterministically generates partitions whose parts have empirically greater homogeneity on average than parts arising from pseudo-random partitions. The data to partition are seriated based on a nearest neighbor rule, and assigned to a part of the partition according to their rank in this seriation. We demonstrate the efficiency of this algorithm on toys and real data sets. Since this algorithm is deterministic, it also provides a way to make reproducible machine learning experiments usually based on pseudo-random partitions.",
keywords = "Deterministic sampling, Distribution, Divergence, Homogeneity measure, Homogeneous partition, Multi-partition, Nearest neighbor, Random sampling, Reproducibility, Seriation",
author = "Michael Aupetit",
year = "2009",
month = "3",
doi = "10.1016/j.neucom.2008.12.024",
language = "English",
volume = "72",
pages = "1379--1389",
journal = "Neurocomputing",
issn = "0925-2312",
publisher = "Elsevier",
number = "7-9",

}

TY - JOUR

T1 - Nearly homogeneous multi-partitioning with a deterministic generator

AU - Aupetit, Michael

PY - 2009/3

Y1 - 2009/3

N2 - The need for homogeneous partitions, where all parts have the same distribution, is ubiquitous in machine learning and in other fields of scientific studies. Especially when only few partitions can be generated. In that case, validation sets need to be distributed the same way as training sets to get good estimates of models' complexities. And when standard data analysis tools cannot deal with too large data sets, the analysis could be performed onto a smaller subset, as far as its homogeneity to the larger one is good enough to get relevant results. However, pseudo-random generators may generate partitions whose parts have very different distributions because the geometry of the data is ignored. In this work, we propose an algorithm which deterministically generates partitions whose parts have empirically greater homogeneity on average than parts arising from pseudo-random partitions. The data to partition are seriated based on a nearest neighbor rule, and assigned to a part of the partition according to their rank in this seriation. We demonstrate the efficiency of this algorithm on toys and real data sets. Since this algorithm is deterministic, it also provides a way to make reproducible machine learning experiments usually based on pseudo-random partitions.

AB - The need for homogeneous partitions, where all parts have the same distribution, is ubiquitous in machine learning and in other fields of scientific studies. Especially when only few partitions can be generated. In that case, validation sets need to be distributed the same way as training sets to get good estimates of models' complexities. And when standard data analysis tools cannot deal with too large data sets, the analysis could be performed onto a smaller subset, as far as its homogeneity to the larger one is good enough to get relevant results. However, pseudo-random generators may generate partitions whose parts have very different distributions because the geometry of the data is ignored. In this work, we propose an algorithm which deterministically generates partitions whose parts have empirically greater homogeneity on average than parts arising from pseudo-random partitions. The data to partition are seriated based on a nearest neighbor rule, and assigned to a part of the partition according to their rank in this seriation. We demonstrate the efficiency of this algorithm on toys and real data sets. Since this algorithm is deterministic, it also provides a way to make reproducible machine learning experiments usually based on pseudo-random partitions.

KW - Deterministic sampling

KW - Distribution

KW - Divergence

KW - Homogeneity measure

KW - Homogeneous partition

KW - Multi-partition

KW - Nearest neighbor

KW - Random sampling

KW - Reproducibility

KW - Seriation

UR - http://www.scopus.com/inward/record.url?scp=61849142212&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=61849142212&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2008.12.024

DO - 10.1016/j.neucom.2008.12.024

M3 - Article

VL - 72

SP - 1379

EP - 1389

JO - Neurocomputing

JF - Neurocomputing

SN - 0925-2312

IS - 7-9

ER -