Estimating missed actual positives using independent classifiers

Sandeep Mane, Jaideep Srivastava, San Yih Hwang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Data mining is increasingly being applied in environments having very high rate of data generation like network intrusion detection [7], where routers generate about 300,000 - 500,000 connections every minute. In such rare class data domains, the cost of missing a rare-class instance is much higher than that of other classes. However, the high cost for manual labeling of instances, the high rate at which data is collected as well as real-time response constraints do not always allow one to determine the actual classes for the collected unlabeled datasets. In our previous work [9], this problem of missed false negatives was explained in context of two different domains - "network intrusion detection" and "business opportunity classification". In such cases, an estimate for the number of such missed high-cost, rare instances will aid in the evaluation of the performance of the modeling technique (e.g. classification) used. A capture-recapture method was used for estimating false negatives, using two or more learning methods (i.e. classifiers). This paper focuses on the dependence between the class labels assigned by such learners. We define the conditional independence for classifiers given a class label and show its relation to the conditional independence of the features sets (used by the classifiers) given a class label. The later is a computationally expensive problem and hence, a heuristic algorithm is proposed for obtaining conditionally independent (or less dependent) feature sets for the classifiers, Initial results of this algorithm on synthetic datasets are promising and further research is being pursued.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
EditorsR.L. Grossman, R. Bayardo, K. Bennett, J. Vaidya
Pages648-653
Number of pages6
DOIs
Publication statusPublished - 2005
Externally publishedYes
EventKDD-2005: 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Chicago, IL, United States
Duration: 21 Aug 200524 Aug 2005

Other

OtherKDD-2005: 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
CountryUnited States
CityChicago, IL
Period21/8/0524/8/05

Fingerprint

Classifiers
Labels
Intrusion detection
Costs
Heuristic algorithms
Routers
Labeling
Data mining
Industry

Keywords

  • Capture-recapture method
  • Conditional independence of classifiers given class label
  • Conditional independence of features given class label
  • Conditional mutual information
  • False negative

ASJC Scopus subject areas

  • Information Systems

Cite this

Mane, S., Srivastava, J., & Hwang, S. Y. (2005). Estimating missed actual positives using independent classifiers. In R. L. Grossman, R. Bayardo, K. Bennett, & J. Vaidya (Eds.), Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 648-653) https://doi.org/10.1145/1081870.1081951

Estimating missed actual positives using independent classifiers. / Mane, Sandeep; Srivastava, Jaideep; Hwang, San Yih.

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ed. / R.L. Grossman; R. Bayardo; K. Bennett; J. Vaidya. 2005. p. 648-653.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Mane, S, Srivastava, J & Hwang, SY 2005, Estimating missed actual positives using independent classifiers. in RL Grossman, R Bayardo, K Bennett & J Vaidya (eds), Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 648-653, KDD-2005: 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, United States, 21/8/05. https://doi.org/10.1145/1081870.1081951
Mane S, Srivastava J, Hwang SY. Estimating missed actual positives using independent classifiers. In Grossman RL, Bayardo R, Bennett K, Vaidya J, editors, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2005. p. 648-653 https://doi.org/10.1145/1081870.1081951
Mane, Sandeep ; Srivastava, Jaideep ; Hwang, San Yih. / Estimating missed actual positives using independent classifiers. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. editor / R.L. Grossman ; R. Bayardo ; K. Bennett ; J. Vaidya. 2005. pp. 648-653
@inproceedings{3ea5946128984fd38648d3905a4e62d3,
title = "Estimating missed actual positives using independent classifiers",
abstract = "Data mining is increasingly being applied in environments having very high rate of data generation like network intrusion detection [7], where routers generate about 300,000 - 500,000 connections every minute. In such rare class data domains, the cost of missing a rare-class instance is much higher than that of other classes. However, the high cost for manual labeling of instances, the high rate at which data is collected as well as real-time response constraints do not always allow one to determine the actual classes for the collected unlabeled datasets. In our previous work [9], this problem of missed false negatives was explained in context of two different domains - {"}network intrusion detection{"} and {"}business opportunity classification{"}. In such cases, an estimate for the number of such missed high-cost, rare instances will aid in the evaluation of the performance of the modeling technique (e.g. classification) used. A capture-recapture method was used for estimating false negatives, using two or more learning methods (i.e. classifiers). This paper focuses on the dependence between the class labels assigned by such learners. We define the conditional independence for classifiers given a class label and show its relation to the conditional independence of the features sets (used by the classifiers) given a class label. The later is a computationally expensive problem and hence, a heuristic algorithm is proposed for obtaining conditionally independent (or less dependent) feature sets for the classifiers, Initial results of this algorithm on synthetic datasets are promising and further research is being pursued.",
keywords = "Capture-recapture method, Conditional independence of classifiers given class label, Conditional independence of features given class label, Conditional mutual information, False negative",
author = "Sandeep Mane and Jaideep Srivastava and Hwang, {San Yih}",
year = "2005",
doi = "10.1145/1081870.1081951",
language = "English",
pages = "648--653",
editor = "R.L. Grossman and R. Bayardo and K. Bennett and J. Vaidya",
booktitle = "Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining",

}

TY - GEN

T1 - Estimating missed actual positives using independent classifiers

AU - Mane, Sandeep

AU - Srivastava, Jaideep

AU - Hwang, San Yih

PY - 2005

Y1 - 2005

N2 - Data mining is increasingly being applied in environments having very high rate of data generation like network intrusion detection [7], where routers generate about 300,000 - 500,000 connections every minute. In such rare class data domains, the cost of missing a rare-class instance is much higher than that of other classes. However, the high cost for manual labeling of instances, the high rate at which data is collected as well as real-time response constraints do not always allow one to determine the actual classes for the collected unlabeled datasets. In our previous work [9], this problem of missed false negatives was explained in context of two different domains - "network intrusion detection" and "business opportunity classification". In such cases, an estimate for the number of such missed high-cost, rare instances will aid in the evaluation of the performance of the modeling technique (e.g. classification) used. A capture-recapture method was used for estimating false negatives, using two or more learning methods (i.e. classifiers). This paper focuses on the dependence between the class labels assigned by such learners. We define the conditional independence for classifiers given a class label and show its relation to the conditional independence of the features sets (used by the classifiers) given a class label. The later is a computationally expensive problem and hence, a heuristic algorithm is proposed for obtaining conditionally independent (or less dependent) feature sets for the classifiers, Initial results of this algorithm on synthetic datasets are promising and further research is being pursued.

AB - Data mining is increasingly being applied in environments having very high rate of data generation like network intrusion detection [7], where routers generate about 300,000 - 500,000 connections every minute. In such rare class data domains, the cost of missing a rare-class instance is much higher than that of other classes. However, the high cost for manual labeling of instances, the high rate at which data is collected as well as real-time response constraints do not always allow one to determine the actual classes for the collected unlabeled datasets. In our previous work [9], this problem of missed false negatives was explained in context of two different domains - "network intrusion detection" and "business opportunity classification". In such cases, an estimate for the number of such missed high-cost, rare instances will aid in the evaluation of the performance of the modeling technique (e.g. classification) used. A capture-recapture method was used for estimating false negatives, using two or more learning methods (i.e. classifiers). This paper focuses on the dependence between the class labels assigned by such learners. We define the conditional independence for classifiers given a class label and show its relation to the conditional independence of the features sets (used by the classifiers) given a class label. The later is a computationally expensive problem and hence, a heuristic algorithm is proposed for obtaining conditionally independent (or less dependent) feature sets for the classifiers, Initial results of this algorithm on synthetic datasets are promising and further research is being pursued.

KW - Capture-recapture method

KW - Conditional independence of classifiers given class label

KW - Conditional independence of features given class label

KW - Conditional mutual information

KW - False negative

UR - http://www.scopus.com/inward/record.url?scp=32344437828&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=32344437828&partnerID=8YFLogxK

U2 - 10.1145/1081870.1081951

DO - 10.1145/1081870.1081951

M3 - Conference contribution

AN - SCOPUS:32344437828

SP - 648

EP - 653

BT - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

A2 - Grossman, R.L.

A2 - Bayardo, R.

A2 - Bennett, K.

A2 - Vaidya, J.

ER -