D-sieve: A novel data processing engine for efficient handling of crises-related social messages

Soudip Roy Chowdhury, Hemant Purohit, Muhammad Imran

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Existing literature demonstrates the usefulness of systemmediated algorithms, such as supervised machine learning for detecting classes of messages in the social-data stream (e.g., topically relevant vs. irrelevant). Thespecification accuracies of these algorithms largely depend upon the size of labeled samples that are provided during the learning phase. Other factors such as class distribution, term distribution among the training set also play an important role on classi er's accuracy. However, due to several reasons (money/time constraints, limited number of skilled labelers etc.), a large sample of labeled messages is often not available immediately for learning an effcientspecification model. Consequently, classifier trained on a poor model often mis- classi es data and hence, the applicability of such learning techniques (especially for the online setting) during ongoing crisis response remains limited. In this paper, we propose a post-classi cation processing step leveraging upon two additional content features- stable hashtag association and stable named entity association, to improve thespecification accuracy for a classifier in realistic settings. We have tested our algorithms on two crisis datasets from Twitter (Hurricane Sandy 2012 and Queensland Floods 2013), and compared our results against the results produced by a best-in-class" baseline online classifier. By showing the consistent better quality results than the baseline algorithm i.e., by correctly classifying the misclassified data points from the prior step (false negative and false positive to true positive and true negative classes, respectively), we demonstrate the applicability of our approach in practice.

Original languageEnglish
Title of host publicationWWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web
PublisherAssociation for Computing Machinery, Inc
Pages1227-1232
Number of pages6
ISBN (Print)9781450334730
DOIs
Publication statusPublished - 18 May 2015
Event24th International Conference on World Wide Web, WWW 2015 - Florence, Italy
Duration: 18 May 201522 May 2015

Other

Other24th International Conference on World Wide Web, WWW 2015
CountryItaly
CityFlorence
Period18/5/1522/5/15

Fingerprint

Sieves
Engines
Classifiers
Hurricanes
Learning systems
Positive ions
Processing

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Cite this

Chowdhury, S. R., Purohit, H., & Imran, M. (2015). D-sieve: A novel data processing engine for efficient handling of crises-related social messages. In WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web (pp. 1227-1232). Association for Computing Machinery, Inc. https://doi.org/10.1145/2740908.2741731

D-sieve : A novel data processing engine for efficient handling of crises-related social messages. / Chowdhury, Soudip Roy; Purohit, Hemant; Imran, Muhammad.

WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. Association for Computing Machinery, Inc, 2015. p. 1227-1232.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Chowdhury, SR, Purohit, H & Imran, M 2015, D-sieve: A novel data processing engine for efficient handling of crises-related social messages. in WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. Association for Computing Machinery, Inc, pp. 1227-1232, 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, 18/5/15. https://doi.org/10.1145/2740908.2741731
Chowdhury SR, Purohit H, Imran M. D-sieve: A novel data processing engine for efficient handling of crises-related social messages. In WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. Association for Computing Machinery, Inc. 2015. p. 1227-1232 https://doi.org/10.1145/2740908.2741731
Chowdhury, Soudip Roy ; Purohit, Hemant ; Imran, Muhammad. / D-sieve : A novel data processing engine for efficient handling of crises-related social messages. WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. Association for Computing Machinery, Inc, 2015. pp. 1227-1232
@inproceedings{a2822b417a2e4485bee86848666425ff,
title = "D-sieve: A novel data processing engine for efficient handling of crises-related social messages",
abstract = "Existing literature demonstrates the usefulness of systemmediated algorithms, such as supervised machine learning for detecting classes of messages in the social-data stream (e.g., topically relevant vs. irrelevant). Thespecification accuracies of these algorithms largely depend upon the size of labeled samples that are provided during the learning phase. Other factors such as class distribution, term distribution among the training set also play an important role on classi er's accuracy. However, due to several reasons (money/time constraints, limited number of skilled labelers etc.), a large sample of labeled messages is often not available immediately for learning an effcientspecification model. Consequently, classifier trained on a poor model often mis- classi es data and hence, the applicability of such learning techniques (especially for the online setting) during ongoing crisis response remains limited. In this paper, we propose a post-classi cation processing step leveraging upon two additional content features- stable hashtag association and stable named entity association, to improve thespecification accuracy for a classifier in realistic settings. We have tested our algorithms on two crisis datasets from Twitter (Hurricane Sandy 2012 and Queensland Floods 2013), and compared our results against the results produced by a best-in-class{"} baseline online classifier. By showing the consistent better quality results than the baseline algorithm i.e., by correctly classifying the misclassified data points from the prior step (false negative and false positive to true positive and true negative classes, respectively), we demonstrate the applicability of our approach in practice.",
author = "Chowdhury, {Soudip Roy} and Hemant Purohit and Muhammad Imran",
year = "2015",
month = "5",
day = "18",
doi = "10.1145/2740908.2741731",
language = "English",
isbn = "9781450334730",
pages = "1227--1232",
booktitle = "WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - D-sieve

T2 - A novel data processing engine for efficient handling of crises-related social messages

AU - Chowdhury, Soudip Roy

AU - Purohit, Hemant

AU - Imran, Muhammad

PY - 2015/5/18

Y1 - 2015/5/18

N2 - Existing literature demonstrates the usefulness of systemmediated algorithms, such as supervised machine learning for detecting classes of messages in the social-data stream (e.g., topically relevant vs. irrelevant). Thespecification accuracies of these algorithms largely depend upon the size of labeled samples that are provided during the learning phase. Other factors such as class distribution, term distribution among the training set also play an important role on classi er's accuracy. However, due to several reasons (money/time constraints, limited number of skilled labelers etc.), a large sample of labeled messages is often not available immediately for learning an effcientspecification model. Consequently, classifier trained on a poor model often mis- classi es data and hence, the applicability of such learning techniques (especially for the online setting) during ongoing crisis response remains limited. In this paper, we propose a post-classi cation processing step leveraging upon two additional content features- stable hashtag association and stable named entity association, to improve thespecification accuracy for a classifier in realistic settings. We have tested our algorithms on two crisis datasets from Twitter (Hurricane Sandy 2012 and Queensland Floods 2013), and compared our results against the results produced by a best-in-class" baseline online classifier. By showing the consistent better quality results than the baseline algorithm i.e., by correctly classifying the misclassified data points from the prior step (false negative and false positive to true positive and true negative classes, respectively), we demonstrate the applicability of our approach in practice.

AB - Existing literature demonstrates the usefulness of systemmediated algorithms, such as supervised machine learning for detecting classes of messages in the social-data stream (e.g., topically relevant vs. irrelevant). Thespecification accuracies of these algorithms largely depend upon the size of labeled samples that are provided during the learning phase. Other factors such as class distribution, term distribution among the training set also play an important role on classi er's accuracy. However, due to several reasons (money/time constraints, limited number of skilled labelers etc.), a large sample of labeled messages is often not available immediately for learning an effcientspecification model. Consequently, classifier trained on a poor model often mis- classi es data and hence, the applicability of such learning techniques (especially for the online setting) during ongoing crisis response remains limited. In this paper, we propose a post-classi cation processing step leveraging upon two additional content features- stable hashtag association and stable named entity association, to improve thespecification accuracy for a classifier in realistic settings. We have tested our algorithms on two crisis datasets from Twitter (Hurricane Sandy 2012 and Queensland Floods 2013), and compared our results against the results produced by a best-in-class" baseline online classifier. By showing the consistent better quality results than the baseline algorithm i.e., by correctly classifying the misclassified data points from the prior step (false negative and false positive to true positive and true negative classes, respectively), we demonstrate the applicability of our approach in practice.

UR - http://www.scopus.com/inward/record.url?scp=84968639139&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84968639139&partnerID=8YFLogxK

U2 - 10.1145/2740908.2741731

DO - 10.1145/2740908.2741731

M3 - Conference contribution

AN - SCOPUS:84968639139

SN - 9781450334730

SP - 1227

EP - 1232

BT - WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web

PB - Association for Computing Machinery, Inc

ER -