A scale space approach for unsupervised feature selection in mass spectra classification for ovarian cancer detection

Michele Ceccarelli, Antonio d'Acierno, Angelo Facchiano

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Background: Mass spectrometry spectra, widely used in proteomics studies as a screening tool for protein profiling and to detect discriminatory signals, are high dimensional data. A large number of local maxima (a.k.a. peaks) have to be analyzed as part of computational pipelines aimed at the realization of efficient predictive and screening protocols. With this kind of data dimensions and samples size the risk of over-fitting and selection bias is pervasive. Therefore the development of bio-informatics methods based on unsupervised feature extraction can lead to general tools which can be applied to several fields of predictive proteomics. Results: We propose a method for feature selection and extraction grounded on the theory of multi-scale spaces for high resolution spectra derived from analysis of serum. Then we use support vector machines for classification. In particular we use a database containing 216 samples spectra divided in 115 cancer and 91 control samples. The overall accuracy averaged over a large cross validation study is 98.18. The area under the ROC curve of the best selected model is 0.9962. Conclusion: We improved previous known results on the problem on the same data, with the advantage that the proposed method has an unsupervised feature selection phase. All the developed code, as MATLAB scripts, can be downloaded from http://medeaserver.isa.cnr.it/dacierno/spectracode.htm.

Original languageEnglish
Article number1471
JournalBMC Bioinformatics
Volume10
Issue numberSUPPL. 12
Publication statusPublished - 15 Oct 2009
Externally publishedYes

Fingerprint

Ovarian Cancer
Scale Space
Ovarian Neoplasms
Feature Selection
Feature extraction
Proteomics
Feature Extraction
Screening
Selection Bias
Validation Studies
Overfitting
Receiver Operating Characteristic Curve
Mass Spectrometry
High-dimensional Data
Profiling
Computational Biology
Cross-validation
ROC Curve
Sample Size
Area Under Curve

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Structural Biology
  • Applied Mathematics

Cite this

A scale space approach for unsupervised feature selection in mass spectra classification for ovarian cancer detection. / Ceccarelli, Michele; d'Acierno, Antonio; Facchiano, Angelo.

In: BMC Bioinformatics, Vol. 10, No. SUPPL. 12, 1471, 15.10.2009.

Research output: Contribution to journalArticle

@article{4dda3562155d4547a1f5aa72ef425cd0,
title = "A scale space approach for unsupervised feature selection in mass spectra classification for ovarian cancer detection",
abstract = "Background: Mass spectrometry spectra, widely used in proteomics studies as a screening tool for protein profiling and to detect discriminatory signals, are high dimensional data. A large number of local maxima (a.k.a. peaks) have to be analyzed as part of computational pipelines aimed at the realization of efficient predictive and screening protocols. With this kind of data dimensions and samples size the risk of over-fitting and selection bias is pervasive. Therefore the development of bio-informatics methods based on unsupervised feature extraction can lead to general tools which can be applied to several fields of predictive proteomics. Results: We propose a method for feature selection and extraction grounded on the theory of multi-scale spaces for high resolution spectra derived from analysis of serum. Then we use support vector machines for classification. In particular we use a database containing 216 samples spectra divided in 115 cancer and 91 control samples. The overall accuracy averaged over a large cross validation study is 98.18. The area under the ROC curve of the best selected model is 0.9962. Conclusion: We improved previous known results on the problem on the same data, with the advantage that the proposed method has an unsupervised feature selection phase. All the developed code, as MATLAB scripts, can be downloaded from http://medeaserver.isa.cnr.it/dacierno/spectracode.htm.",
author = "Michele Ceccarelli and Antonio d'Acierno and Angelo Facchiano",
year = "2009",
month = "10",
day = "15",
language = "English",
volume = "10",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "SUPPL. 12",

}

TY - JOUR

T1 - A scale space approach for unsupervised feature selection in mass spectra classification for ovarian cancer detection

AU - Ceccarelli, Michele

AU - d'Acierno, Antonio

AU - Facchiano, Angelo

PY - 2009/10/15

Y1 - 2009/10/15

N2 - Background: Mass spectrometry spectra, widely used in proteomics studies as a screening tool for protein profiling and to detect discriminatory signals, are high dimensional data. A large number of local maxima (a.k.a. peaks) have to be analyzed as part of computational pipelines aimed at the realization of efficient predictive and screening protocols. With this kind of data dimensions and samples size the risk of over-fitting and selection bias is pervasive. Therefore the development of bio-informatics methods based on unsupervised feature extraction can lead to general tools which can be applied to several fields of predictive proteomics. Results: We propose a method for feature selection and extraction grounded on the theory of multi-scale spaces for high resolution spectra derived from analysis of serum. Then we use support vector machines for classification. In particular we use a database containing 216 samples spectra divided in 115 cancer and 91 control samples. The overall accuracy averaged over a large cross validation study is 98.18. The area under the ROC curve of the best selected model is 0.9962. Conclusion: We improved previous known results on the problem on the same data, with the advantage that the proposed method has an unsupervised feature selection phase. All the developed code, as MATLAB scripts, can be downloaded from http://medeaserver.isa.cnr.it/dacierno/spectracode.htm.

AB - Background: Mass spectrometry spectra, widely used in proteomics studies as a screening tool for protein profiling and to detect discriminatory signals, are high dimensional data. A large number of local maxima (a.k.a. peaks) have to be analyzed as part of computational pipelines aimed at the realization of efficient predictive and screening protocols. With this kind of data dimensions and samples size the risk of over-fitting and selection bias is pervasive. Therefore the development of bio-informatics methods based on unsupervised feature extraction can lead to general tools which can be applied to several fields of predictive proteomics. Results: We propose a method for feature selection and extraction grounded on the theory of multi-scale spaces for high resolution spectra derived from analysis of serum. Then we use support vector machines for classification. In particular we use a database containing 216 samples spectra divided in 115 cancer and 91 control samples. The overall accuracy averaged over a large cross validation study is 98.18. The area under the ROC curve of the best selected model is 0.9962. Conclusion: We improved previous known results on the problem on the same data, with the advantage that the proposed method has an unsupervised feature selection phase. All the developed code, as MATLAB scripts, can be downloaded from http://medeaserver.isa.cnr.it/dacierno/spectracode.htm.

UR - http://www.scopus.com/inward/record.url?scp=70449504251&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70449504251&partnerID=8YFLogxK

M3 - Article

VL - 10

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - SUPPL. 12

M1 - 1471

ER -