Efficient feature selection and multiclass classification with integrated instance and model based learning

Zhenqiu Liu, Halima Bensmail, Ming Tan

Research output: Chapter in Book/Report/Conference proceedingChapter

6 Citations (Scopus)

Abstract

Multiclass classification and feature (variable) selections are commonly encountered in many biological and medical applications. However, extending binary classification approaches to multiclass problems is not trivial. Instance-based methods such as the K nearest neighbor (KNN) can naturally extend to multiclass problems and usually perform well with unbalanced data, but suffer from the curse of dimensionality. Their performance is degraded when applied to high dimensional data. On the other hand, model-based methods such as logistic regression require the decomposition of the multiclass problem into several binary problems with one-vs.-one or one-vs.-rest schemes. Even though they can be applied to high dimensional data with L1 or Lp penalized methods, such approaches can only select independent features and the features selected with different binary problems are usually different. They also produce unbalanced classification problems with one vs. the rest scheme even if the original multiclass problem is balanced. By combining instance-based and model-based learning, we propose an efficient learning method with integrated KNN and constrained logistic regression (KNNLog) for simultaneous multiclass classification and feature selection. Our proposed method simultaneously minimizes the intra-class distance and maximizes the interclass distance with fewer estimated parameters. It is very efficient for problems with small sample size and unbalanced classes, a case common in many real applications. In addition, our model-based feature selection methods can identify highly correlated features simultaneously avoiding the multiplicity problem due to multiple tests. The proposed method is evaluated with simulation and real data including one unbalanced microRNA dataset for leukemia and one multi-class metagenomic dataset from the Human Microbiome Project (HMP). It performs well with limited computational experiments.

Original languageEnglish
Title of host publicationEvolutionary Bioinformatics
Pages197-205
Number of pages9
Volume2012
Edition8
DOIs
Publication statusPublished - 15 May 2012

Fingerprint

Feature extraction
learning
Learning
Logistics
Medical applications
methodology
logistics
Logistic Models
Metagenomics
selection methods
Decomposition
leukemia
microRNA
Microbiota
method
MicroRNAs
Sample Size
Leukemia
decomposition
degradation

Keywords

  • Feature selection
  • High-dimensional data
  • Multiclass classification
  • Statistical learning

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Computer Science Applications
  • Genetics

Cite this

Efficient feature selection and multiclass classification with integrated instance and model based learning. / Liu, Zhenqiu; Bensmail, Halima; Tan, Ming.

Evolutionary Bioinformatics. Vol. 2012 8. ed. 2012. p. 197-205.

Research output: Chapter in Book/Report/Conference proceedingChapter

Liu, Zhenqiu ; Bensmail, Halima ; Tan, Ming. / Efficient feature selection and multiclass classification with integrated instance and model based learning. Evolutionary Bioinformatics. Vol. 2012 8. ed. 2012. pp. 197-205
@inbook{04e227c9a58b4b9fa4fda3d9e6f2eafa,
title = "Efficient feature selection and multiclass classification with integrated instance and model based learning",
abstract = "Multiclass classification and feature (variable) selections are commonly encountered in many biological and medical applications. However, extending binary classification approaches to multiclass problems is not trivial. Instance-based methods such as the K nearest neighbor (KNN) can naturally extend to multiclass problems and usually perform well with unbalanced data, but suffer from the curse of dimensionality. Their performance is degraded when applied to high dimensional data. On the other hand, model-based methods such as logistic regression require the decomposition of the multiclass problem into several binary problems with one-vs.-one or one-vs.-rest schemes. Even though they can be applied to high dimensional data with L1 or Lp penalized methods, such approaches can only select independent features and the features selected with different binary problems are usually different. They also produce unbalanced classification problems with one vs. the rest scheme even if the original multiclass problem is balanced. By combining instance-based and model-based learning, we propose an efficient learning method with integrated KNN and constrained logistic regression (KNNLog) for simultaneous multiclass classification and feature selection. Our proposed method simultaneously minimizes the intra-class distance and maximizes the interclass distance with fewer estimated parameters. It is very efficient for problems with small sample size and unbalanced classes, a case common in many real applications. In addition, our model-based feature selection methods can identify highly correlated features simultaneously avoiding the multiplicity problem due to multiple tests. The proposed method is evaluated with simulation and real data including one unbalanced microRNA dataset for leukemia and one multi-class metagenomic dataset from the Human Microbiome Project (HMP). It performs well with limited computational experiments.",
keywords = "Feature selection, High-dimensional data, Multiclass classification, Statistical learning",
author = "Zhenqiu Liu and Halima Bensmail and Ming Tan",
year = "2012",
month = "5",
day = "15",
doi = "10.4137/EBO.S9407",
language = "English",
volume = "2012",
pages = "197--205",
booktitle = "Evolutionary Bioinformatics",
edition = "8",

}

TY - CHAP

T1 - Efficient feature selection and multiclass classification with integrated instance and model based learning

AU - Liu, Zhenqiu

AU - Bensmail, Halima

AU - Tan, Ming

PY - 2012/5/15

Y1 - 2012/5/15

N2 - Multiclass classification and feature (variable) selections are commonly encountered in many biological and medical applications. However, extending binary classification approaches to multiclass problems is not trivial. Instance-based methods such as the K nearest neighbor (KNN) can naturally extend to multiclass problems and usually perform well with unbalanced data, but suffer from the curse of dimensionality. Their performance is degraded when applied to high dimensional data. On the other hand, model-based methods such as logistic regression require the decomposition of the multiclass problem into several binary problems with one-vs.-one or one-vs.-rest schemes. Even though they can be applied to high dimensional data with L1 or Lp penalized methods, such approaches can only select independent features and the features selected with different binary problems are usually different. They also produce unbalanced classification problems with one vs. the rest scheme even if the original multiclass problem is balanced. By combining instance-based and model-based learning, we propose an efficient learning method with integrated KNN and constrained logistic regression (KNNLog) for simultaneous multiclass classification and feature selection. Our proposed method simultaneously minimizes the intra-class distance and maximizes the interclass distance with fewer estimated parameters. It is very efficient for problems with small sample size and unbalanced classes, a case common in many real applications. In addition, our model-based feature selection methods can identify highly correlated features simultaneously avoiding the multiplicity problem due to multiple tests. The proposed method is evaluated with simulation and real data including one unbalanced microRNA dataset for leukemia and one multi-class metagenomic dataset from the Human Microbiome Project (HMP). It performs well with limited computational experiments.

AB - Multiclass classification and feature (variable) selections are commonly encountered in many biological and medical applications. However, extending binary classification approaches to multiclass problems is not trivial. Instance-based methods such as the K nearest neighbor (KNN) can naturally extend to multiclass problems and usually perform well with unbalanced data, but suffer from the curse of dimensionality. Their performance is degraded when applied to high dimensional data. On the other hand, model-based methods such as logistic regression require the decomposition of the multiclass problem into several binary problems with one-vs.-one or one-vs.-rest schemes. Even though they can be applied to high dimensional data with L1 or Lp penalized methods, such approaches can only select independent features and the features selected with different binary problems are usually different. They also produce unbalanced classification problems with one vs. the rest scheme even if the original multiclass problem is balanced. By combining instance-based and model-based learning, we propose an efficient learning method with integrated KNN and constrained logistic regression (KNNLog) for simultaneous multiclass classification and feature selection. Our proposed method simultaneously minimizes the intra-class distance and maximizes the interclass distance with fewer estimated parameters. It is very efficient for problems with small sample size and unbalanced classes, a case common in many real applications. In addition, our model-based feature selection methods can identify highly correlated features simultaneously avoiding the multiplicity problem due to multiple tests. The proposed method is evaluated with simulation and real data including one unbalanced microRNA dataset for leukemia and one multi-class metagenomic dataset from the Human Microbiome Project (HMP). It performs well with limited computational experiments.

KW - Feature selection

KW - High-dimensional data

KW - Multiclass classification

KW - Statistical learning

UR - http://www.scopus.com/inward/record.url?scp=84860809676&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84860809676&partnerID=8YFLogxK

U2 - 10.4137/EBO.S9407

DO - 10.4137/EBO.S9407

M3 - Chapter

C2 - 22577297

AN - SCOPUS:84860809676

VL - 2012

SP - 197

EP - 205

BT - Evolutionary Bioinformatics

ER -