Combining multiple hypothesis testing and affinity propagation clustering leads to accurate, robust and sample size independent classification on gene expression data

Argiris Sakellariou, Despina Sanoudou, George Spyrou

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

Background: A feature selection method in microarray gene expression data should be independent of platform, disease and dataset size. Our hypothesis is that among the statistically significant ranked genes in a gene list, there should be clusters of genes that share similar biological functions related to the investigated disease. Thus, instead of keeping N top ranked genes, it would be more appropriate to define and keep a number of gene cluster exemplars.Results: We propose a hybrid FS method (mAP-KL), which combines multiple hypothesis testing and affinity propagation (AP)-clustering algorithm along with the Krzanowski & Lai cluster quality index, to select a small yet informative subset of genes. We applied mAP-KL on real microarray data, as well as on simulated data, and compared its performance against 13 other feature selection approaches. Across a variety of diseases and number of samples, mAP-KL presents competitive classification results, particularly in neuromuscular diseases, where its overall AUC score was 0.91. Furthermore, mAP-KL generates concise yet biologically relevant and informative N-gene expression signatures, which can serve as a valuable tool for diagnostic and prognostic purposes, as well as a source of potential disease biomarkers in a broad range of diseases.Conclusions: mAP-KL is a data-driven and classifier-independent hybrid feature selection method, which applies to any disease classification problem based on microarray data, regardless of the available samples. Combining multiple hypothesis testing and AP leads to subsets of genes, which classify unknown samples from both, small and large patient cohorts with high accuracy.

Original languageEnglish
Article number270
JournalBMC Bioinformatics
Volume13
Issue number1
DOIs
Publication statusPublished - 17 Oct 2012
Externally publishedYes

Fingerprint

Multiple Hypothesis Testing
Gene Expression Data
Gene expression
Sample Size
Affine transformation
Cluster Analysis
Genes
Clustering
Propagation
Gene
Gene Expression
Testing
Microarrays
Microarray Data
Feature Selection
Feature extraction
Multigene Family
Neuromuscular Diseases
Subset
Transcriptome

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics
  • Structural Biology

Cite this

Combining multiple hypothesis testing and affinity propagation clustering leads to accurate, robust and sample size independent classification on gene expression data. / Sakellariou, Argiris; Sanoudou, Despina; Spyrou, George.

In: BMC Bioinformatics, Vol. 13, No. 1, 270, 17.10.2012.

Research output: Contribution to journalArticle

@article{be81756efac2450091c3ab49b01c4bee,
title = "Combining multiple hypothesis testing and affinity propagation clustering leads to accurate, robust and sample size independent classification on gene expression data",
abstract = "Background: A feature selection method in microarray gene expression data should be independent of platform, disease and dataset size. Our hypothesis is that among the statistically significant ranked genes in a gene list, there should be clusters of genes that share similar biological functions related to the investigated disease. Thus, instead of keeping N top ranked genes, it would be more appropriate to define and keep a number of gene cluster exemplars.Results: We propose a hybrid FS method (mAP-KL), which combines multiple hypothesis testing and affinity propagation (AP)-clustering algorithm along with the Krzanowski & Lai cluster quality index, to select a small yet informative subset of genes. We applied mAP-KL on real microarray data, as well as on simulated data, and compared its performance against 13 other feature selection approaches. Across a variety of diseases and number of samples, mAP-KL presents competitive classification results, particularly in neuromuscular diseases, where its overall AUC score was 0.91. Furthermore, mAP-KL generates concise yet biologically relevant and informative N-gene expression signatures, which can serve as a valuable tool for diagnostic and prognostic purposes, as well as a source of potential disease biomarkers in a broad range of diseases.Conclusions: mAP-KL is a data-driven and classifier-independent hybrid feature selection method, which applies to any disease classification problem based on microarray data, regardless of the available samples. Combining multiple hypothesis testing and AP leads to subsets of genes, which classify unknown samples from both, small and large patient cohorts with high accuracy.",
author = "Argiris Sakellariou and Despina Sanoudou and George Spyrou",
year = "2012",
month = "10",
day = "17",
doi = "10.1186/1471-2105-13-270",
language = "English",
volume = "13",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Combining multiple hypothesis testing and affinity propagation clustering leads to accurate, robust and sample size independent classification on gene expression data

AU - Sakellariou, Argiris

AU - Sanoudou, Despina

AU - Spyrou, George

PY - 2012/10/17

Y1 - 2012/10/17

N2 - Background: A feature selection method in microarray gene expression data should be independent of platform, disease and dataset size. Our hypothesis is that among the statistically significant ranked genes in a gene list, there should be clusters of genes that share similar biological functions related to the investigated disease. Thus, instead of keeping N top ranked genes, it would be more appropriate to define and keep a number of gene cluster exemplars.Results: We propose a hybrid FS method (mAP-KL), which combines multiple hypothesis testing and affinity propagation (AP)-clustering algorithm along with the Krzanowski & Lai cluster quality index, to select a small yet informative subset of genes. We applied mAP-KL on real microarray data, as well as on simulated data, and compared its performance against 13 other feature selection approaches. Across a variety of diseases and number of samples, mAP-KL presents competitive classification results, particularly in neuromuscular diseases, where its overall AUC score was 0.91. Furthermore, mAP-KL generates concise yet biologically relevant and informative N-gene expression signatures, which can serve as a valuable tool for diagnostic and prognostic purposes, as well as a source of potential disease biomarkers in a broad range of diseases.Conclusions: mAP-KL is a data-driven and classifier-independent hybrid feature selection method, which applies to any disease classification problem based on microarray data, regardless of the available samples. Combining multiple hypothesis testing and AP leads to subsets of genes, which classify unknown samples from both, small and large patient cohorts with high accuracy.

AB - Background: A feature selection method in microarray gene expression data should be independent of platform, disease and dataset size. Our hypothesis is that among the statistically significant ranked genes in a gene list, there should be clusters of genes that share similar biological functions related to the investigated disease. Thus, instead of keeping N top ranked genes, it would be more appropriate to define and keep a number of gene cluster exemplars.Results: We propose a hybrid FS method (mAP-KL), which combines multiple hypothesis testing and affinity propagation (AP)-clustering algorithm along with the Krzanowski & Lai cluster quality index, to select a small yet informative subset of genes. We applied mAP-KL on real microarray data, as well as on simulated data, and compared its performance against 13 other feature selection approaches. Across a variety of diseases and number of samples, mAP-KL presents competitive classification results, particularly in neuromuscular diseases, where its overall AUC score was 0.91. Furthermore, mAP-KL generates concise yet biologically relevant and informative N-gene expression signatures, which can serve as a valuable tool for diagnostic and prognostic purposes, as well as a source of potential disease biomarkers in a broad range of diseases.Conclusions: mAP-KL is a data-driven and classifier-independent hybrid feature selection method, which applies to any disease classification problem based on microarray data, regardless of the available samples. Combining multiple hypothesis testing and AP leads to subsets of genes, which classify unknown samples from both, small and large patient cohorts with high accuracy.

UR - http://www.scopus.com/inward/record.url?scp=84867437124&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84867437124&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-13-270

DO - 10.1186/1471-2105-13-270

M3 - Article

VL - 13

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 270

ER -