A robust decision tree algorithm for imbalanced data sets

Wei Liu, Sanjay Chawla, David A. Cieslak, Nitesh V. Chawla

Research output: Chapter in Book/Report/Conference proceedingConference contribution

82 Citations (Scopus)

Abstract

We propose a new decision tree algorithm, Class Confidence Proportion Decision Tree (CCPDT), which is robust and insensitive to size of classes and generates rules which are statistically significant. In order to make decision trees robust, we begin by expressing Information Gain, the metric used in C4.5, in terms of confidence of a rule. This allows us to immediately explain why Information Gain, like confidence, results in rules which are biased towards the majority class. To overcome this bias, we introduce a new measure, Class Confidence Proportion (CCP), which forms the basis of CCPDT. To generate rules which are statistically significant we design a novel and efficient top-down and bottom-up approach which uses Fisher's exact test to prune branches of the tree which are not statistically significant. Together these two changes yield a classifier that performs statistically better than not only traditional decision trees but also trees learned from data that has been balanced by well known sampling techniques. Our claims are confirmed through extensive experiments and comparisons against C4.5, CART, HDDT and SPARCCC.

Original languageEnglish
Title of host publicationProceedings of the 10th SIAM International Conference on Data Mining, SDM 2010
Pages766-777
Number of pages12
Publication statusPublished - 2010
Externally publishedYes
Event10th SIAM International Conference on Data Mining, SDM 2010 - Columbus, OH
Duration: 29 Apr 20101 May 2010

Other

Other10th SIAM International Conference on Data Mining, SDM 2010
CityColumbus, OH
Period29/4/101/5/10

Fingerprint

Decision trees
Classifiers
Sampling
Experiments

ASJC Scopus subject areas

  • Software

Cite this

Liu, W., Chawla, S., Cieslak, D. A., & Chawla, N. V. (2010). A robust decision tree algorithm for imbalanced data sets. In Proceedings of the 10th SIAM International Conference on Data Mining, SDM 2010 (pp. 766-777)

A robust decision tree algorithm for imbalanced data sets. / Liu, Wei; Chawla, Sanjay; Cieslak, David A.; Chawla, Nitesh V.

Proceedings of the 10th SIAM International Conference on Data Mining, SDM 2010. 2010. p. 766-777.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Liu, W, Chawla, S, Cieslak, DA & Chawla, NV 2010, A robust decision tree algorithm for imbalanced data sets. in Proceedings of the 10th SIAM International Conference on Data Mining, SDM 2010. pp. 766-777, 10th SIAM International Conference on Data Mining, SDM 2010, Columbus, OH, 29/4/10.
Liu W, Chawla S, Cieslak DA, Chawla NV. A robust decision tree algorithm for imbalanced data sets. In Proceedings of the 10th SIAM International Conference on Data Mining, SDM 2010. 2010. p. 766-777
Liu, Wei ; Chawla, Sanjay ; Cieslak, David A. ; Chawla, Nitesh V. / A robust decision tree algorithm for imbalanced data sets. Proceedings of the 10th SIAM International Conference on Data Mining, SDM 2010. 2010. pp. 766-777
@inproceedings{2fe8fd718f8443d3a9705fcdf5c6d10e,
title = "A robust decision tree algorithm for imbalanced data sets",
abstract = "We propose a new decision tree algorithm, Class Confidence Proportion Decision Tree (CCPDT), which is robust and insensitive to size of classes and generates rules which are statistically significant. In order to make decision trees robust, we begin by expressing Information Gain, the metric used in C4.5, in terms of confidence of a rule. This allows us to immediately explain why Information Gain, like confidence, results in rules which are biased towards the majority class. To overcome this bias, we introduce a new measure, Class Confidence Proportion (CCP), which forms the basis of CCPDT. To generate rules which are statistically significant we design a novel and efficient top-down and bottom-up approach which uses Fisher's exact test to prune branches of the tree which are not statistically significant. Together these two changes yield a classifier that performs statistically better than not only traditional decision trees but also trees learned from data that has been balanced by well known sampling techniques. Our claims are confirmed through extensive experiments and comparisons against C4.5, CART, HDDT and SPARCCC.",
author = "Wei Liu and Sanjay Chawla and Cieslak, {David A.} and Chawla, {Nitesh V.}",
year = "2010",
language = "English",
pages = "766--777",
booktitle = "Proceedings of the 10th SIAM International Conference on Data Mining, SDM 2010",

}

TY - GEN

T1 - A robust decision tree algorithm for imbalanced data sets

AU - Liu, Wei

AU - Chawla, Sanjay

AU - Cieslak, David A.

AU - Chawla, Nitesh V.

PY - 2010

Y1 - 2010

N2 - We propose a new decision tree algorithm, Class Confidence Proportion Decision Tree (CCPDT), which is robust and insensitive to size of classes and generates rules which are statistically significant. In order to make decision trees robust, we begin by expressing Information Gain, the metric used in C4.5, in terms of confidence of a rule. This allows us to immediately explain why Information Gain, like confidence, results in rules which are biased towards the majority class. To overcome this bias, we introduce a new measure, Class Confidence Proportion (CCP), which forms the basis of CCPDT. To generate rules which are statistically significant we design a novel and efficient top-down and bottom-up approach which uses Fisher's exact test to prune branches of the tree which are not statistically significant. Together these two changes yield a classifier that performs statistically better than not only traditional decision trees but also trees learned from data that has been balanced by well known sampling techniques. Our claims are confirmed through extensive experiments and comparisons against C4.5, CART, HDDT and SPARCCC.

AB - We propose a new decision tree algorithm, Class Confidence Proportion Decision Tree (CCPDT), which is robust and insensitive to size of classes and generates rules which are statistically significant. In order to make decision trees robust, we begin by expressing Information Gain, the metric used in C4.5, in terms of confidence of a rule. This allows us to immediately explain why Information Gain, like confidence, results in rules which are biased towards the majority class. To overcome this bias, we introduce a new measure, Class Confidence Proportion (CCP), which forms the basis of CCPDT. To generate rules which are statistically significant we design a novel and efficient top-down and bottom-up approach which uses Fisher's exact test to prune branches of the tree which are not statistically significant. Together these two changes yield a classifier that performs statistically better than not only traditional decision trees but also trees learned from data that has been balanced by well known sampling techniques. Our claims are confirmed through extensive experiments and comparisons against C4.5, CART, HDDT and SPARCCC.

UR - http://www.scopus.com/inward/record.url?scp=84873579825&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84873579825&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84873579825

SP - 766

EP - 777

BT - Proceedings of the 10th SIAM International Conference on Data Mining, SDM 2010

ER -