Efficient online evaluation of big data stream classifiers

Albert Bifet, Gianmarco Morales, Jesse Read, Geoff Holmes, Bernhard Pfahringer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

46 Citations (Scopus)

Abstract

The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.

Original languageEnglish
Title of host publicationKDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages59-68
Number of pages10
Volume2015-August
ISBN (Electronic)9781450336642
DOIs
Publication statusPublished - 10 Aug 2015
Externally publishedYes
Event21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015 - Sydney, Australia
Duration: 10 Aug 201513 Aug 2015

Other

Other21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015
CountryAustralia
CitySydney
Period10/8/1513/8/15

Fingerprint

Classifiers
Big data
Testing

Keywords

  • Classification
  • Data streams
  • Evaluation
  • Online learning

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Bifet, A., Morales, G., Read, J., Holmes, G., & Pfahringer, B. (2015). Efficient online evaluation of big data stream classifiers. In KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Vol. 2015-August, pp. 59-68). Association for Computing Machinery. https://doi.org/10.1145/2783258.2783372

Efficient online evaluation of big data stream classifiers. / Bifet, Albert; Morales, Gianmarco; Read, Jesse; Holmes, Geoff; Pfahringer, Bernhard.

KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Vol. 2015-August Association for Computing Machinery, 2015. p. 59-68.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Bifet, A, Morales, G, Read, J, Holmes, G & Pfahringer, B 2015, Efficient online evaluation of big data stream classifiers. in KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. vol. 2015-August, Association for Computing Machinery, pp. 59-68, 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015, Sydney, Australia, 10/8/15. https://doi.org/10.1145/2783258.2783372
Bifet A, Morales G, Read J, Holmes G, Pfahringer B. Efficient online evaluation of big data stream classifiers. In KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Vol. 2015-August. Association for Computing Machinery. 2015. p. 59-68 https://doi.org/10.1145/2783258.2783372
Bifet, Albert ; Morales, Gianmarco ; Read, Jesse ; Holmes, Geoff ; Pfahringer, Bernhard. / Efficient online evaluation of big data stream classifiers. KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Vol. 2015-August Association for Computing Machinery, 2015. pp. 59-68
@inproceedings{90b07d5cbfeb4a2ca68363bc3cce1217,
title = "Efficient online evaluation of big data stream classifiers",
abstract = "The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.",
keywords = "Classification, Data streams, Evaluation, Online learning",
author = "Albert Bifet and Gianmarco Morales and Jesse Read and Geoff Holmes and Bernhard Pfahringer",
year = "2015",
month = "8",
day = "10",
doi = "10.1145/2783258.2783372",
language = "English",
volume = "2015-August",
pages = "59--68",
booktitle = "KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - Efficient online evaluation of big data stream classifiers

AU - Bifet, Albert

AU - Morales, Gianmarco

AU - Read, Jesse

AU - Holmes, Geoff

AU - Pfahringer, Bernhard

PY - 2015/8/10

Y1 - 2015/8/10

N2 - The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.

AB - The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.

KW - Classification

KW - Data streams

KW - Evaluation

KW - Online learning

UR - http://www.scopus.com/inward/record.url?scp=84954139718&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84954139718&partnerID=8YFLogxK

U2 - 10.1145/2783258.2783372

DO - 10.1145/2783258.2783372

M3 - Conference contribution

VL - 2015-August

SP - 59

EP - 68

BT - KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

PB - Association for Computing Machinery

ER -