A quadratic mean based supervised learning model for managing data skewness

Wei Liu, Sanjay Chawla

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Citations (Scopus)

Abstract

In this paper, we study the problem of data skewness. A data set is skewed/imbalanced if its dependent variable is asymmetrically distributed. Dealing with skewed data sets has been identified as one of the ten most challenging problems in data mining research. We address the problem of class skewness for supervised learning models which are based on optimizing a regularized empirical risk function. These include both classification and regression models for discrete and continuous dependent variables. Classical empirical risk minimization is akin to minimizing the arithmetic mean of prediction errors, in which approach the induction process is biased towards the majority class for skewed data. To overcome this drawback, we propose a quadratic mean based learning framework (QMLearn) that is robust and insensitive to class skewness. We will note that minimizing the quadratic mean is a convex optimization problem and hence can be efficiently solved for large and high dimensional data. Comprehensive experiments demonstrate that the QMLearn model significantly outperforms existing statistical learners including logistic regression, support vector machines, linear regression, support vector regression and quantile regression etc.

Original languageEnglish
Title of host publicationProceedings of the 11th SIAM International Conference on Data Mining, SDM 2011
Pages188-198
Number of pages11
Publication statusPublished - 2011
Externally publishedYes
Event11th SIAM International Conference on Data Mining, SDM 2011 - Mesa, AZ, United States
Duration: 28 Apr 201130 Apr 2011

Other

Other11th SIAM International Conference on Data Mining, SDM 2011
CountryUnited States
CityMesa, AZ
Period28/4/1130/4/11

Fingerprint

Supervised learning
Convex optimization
Linear regression
Support vector machines
Data mining
Logistics
Experiments

Keywords

  • Convex optimization
  • Data skewness
  • Quadratic mean

ASJC Scopus subject areas

  • Software

Cite this

Liu, W., & Chawla, S. (2011). A quadratic mean based supervised learning model for managing data skewness. In Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011 (pp. 188-198)

A quadratic mean based supervised learning model for managing data skewness. / Liu, Wei; Chawla, Sanjay.

Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011. 2011. p. 188-198.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Liu, W & Chawla, S 2011, A quadratic mean based supervised learning model for managing data skewness. in Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011. pp. 188-198, 11th SIAM International Conference on Data Mining, SDM 2011, Mesa, AZ, United States, 28/4/11.
Liu W, Chawla S. A quadratic mean based supervised learning model for managing data skewness. In Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011. 2011. p. 188-198
Liu, Wei ; Chawla, Sanjay. / A quadratic mean based supervised learning model for managing data skewness. Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011. 2011. pp. 188-198
@inproceedings{d788785f25b84083974501a6778a45b1,
title = "A quadratic mean based supervised learning model for managing data skewness",
abstract = "In this paper, we study the problem of data skewness. A data set is skewed/imbalanced if its dependent variable is asymmetrically distributed. Dealing with skewed data sets has been identified as one of the ten most challenging problems in data mining research. We address the problem of class skewness for supervised learning models which are based on optimizing a regularized empirical risk function. These include both classification and regression models for discrete and continuous dependent variables. Classical empirical risk minimization is akin to minimizing the arithmetic mean of prediction errors, in which approach the induction process is biased towards the majority class for skewed data. To overcome this drawback, we propose a quadratic mean based learning framework (QMLearn) that is robust and insensitive to class skewness. We will note that minimizing the quadratic mean is a convex optimization problem and hence can be efficiently solved for large and high dimensional data. Comprehensive experiments demonstrate that the QMLearn model significantly outperforms existing statistical learners including logistic regression, support vector machines, linear regression, support vector regression and quantile regression etc.",
keywords = "Convex optimization, Data skewness, Quadratic mean",
author = "Wei Liu and Sanjay Chawla",
year = "2011",
language = "English",
isbn = "9780898719925",
pages = "188--198",
booktitle = "Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011",

}

TY - GEN

T1 - A quadratic mean based supervised learning model for managing data skewness

AU - Liu, Wei

AU - Chawla, Sanjay

PY - 2011

Y1 - 2011

N2 - In this paper, we study the problem of data skewness. A data set is skewed/imbalanced if its dependent variable is asymmetrically distributed. Dealing with skewed data sets has been identified as one of the ten most challenging problems in data mining research. We address the problem of class skewness for supervised learning models which are based on optimizing a regularized empirical risk function. These include both classification and regression models for discrete and continuous dependent variables. Classical empirical risk minimization is akin to minimizing the arithmetic mean of prediction errors, in which approach the induction process is biased towards the majority class for skewed data. To overcome this drawback, we propose a quadratic mean based learning framework (QMLearn) that is robust and insensitive to class skewness. We will note that minimizing the quadratic mean is a convex optimization problem and hence can be efficiently solved for large and high dimensional data. Comprehensive experiments demonstrate that the QMLearn model significantly outperforms existing statistical learners including logistic regression, support vector machines, linear regression, support vector regression and quantile regression etc.

AB - In this paper, we study the problem of data skewness. A data set is skewed/imbalanced if its dependent variable is asymmetrically distributed. Dealing with skewed data sets has been identified as one of the ten most challenging problems in data mining research. We address the problem of class skewness for supervised learning models which are based on optimizing a regularized empirical risk function. These include both classification and regression models for discrete and continuous dependent variables. Classical empirical risk minimization is akin to minimizing the arithmetic mean of prediction errors, in which approach the induction process is biased towards the majority class for skewed data. To overcome this drawback, we propose a quadratic mean based learning framework (QMLearn) that is robust and insensitive to class skewness. We will note that minimizing the quadratic mean is a convex optimization problem and hence can be efficiently solved for large and high dimensional data. Comprehensive experiments demonstrate that the QMLearn model significantly outperforms existing statistical learners including logistic regression, support vector machines, linear regression, support vector regression and quantile regression etc.

KW - Convex optimization

KW - Data skewness

KW - Quadratic mean

UR - http://www.scopus.com/inward/record.url?scp=84857182086&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84857182086&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84857182086

SN - 9780898719925

SP - 188

EP - 198

BT - Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011

ER -