Semi-automated text classification for sensitivity identification

Giacomo Berardi, Andrea Esuli, Craig Macdonald, Iadh Ounis, Fabrizio Sebastiani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

Sensitive documents are those that cannot be made public, e.g., for personal or organizational privacy reasons. For instance, documents requested through Freedom of Information mechanisms must be manually reviewed for the presence of sensitive information before their actual release. Hence, tools that can assist human reviewers in spotting sensitive information are of great value to government organizations subject to Freedom of Information laws. We look at sensitivity identification in terms of semi-automated text classification (SATC), the task of ranking automatically classified documents so as to optimize the cost-effectiveness of human post-checking work. We use a recently proposed utility-theoretic approach to SATC that explicitly optimizes the chosen effectiveness function when ranking the documents by sensitivity; this is especially useful in our case, since sensitivity identification is a recall-oriented task, thus requiring the use of a recall-oriented evaluation measure such as F2. We show the validity of this approach by running experiments on a multi-label multi-class dataset of government documents manually annotated according to different types of sensitivity.

Original languageEnglish
Title of host publicationInternational Conference on Information and Knowledge Management, Proceedings
PublisherAssociation for Computing Machinery
Pages1711-1714
Number of pages4
Volume19-23-Oct-2015
ISBN (Print)9781450337946
DOIs
Publication statusPublished - 17 Oct 2015
Event24th ACM International Conference on Information and Knowledge Management, CIKM 2015 - Melbourne, Australia
Duration: 19 Oct 201523 Oct 2015

Other

Other24th ACM International Conference on Information and Knowledge Management, CIKM 2015
CountryAustralia
CityMelbourne
Period19/10/1523/10/15

Fingerprint

Text classification
Government
Freedom of information
Evaluation
Cost-effectiveness
Ranking function
Ranking
Experiment
Privacy

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Berardi, G., Esuli, A., Macdonald, C., Ounis, I., & Sebastiani, F. (2015). Semi-automated text classification for sensitivity identification. In International Conference on Information and Knowledge Management, Proceedings (Vol. 19-23-Oct-2015, pp. 1711-1714). Association for Computing Machinery. https://doi.org/10.1145/2806416.2806597

Semi-automated text classification for sensitivity identification. / Berardi, Giacomo; Esuli, Andrea; Macdonald, Craig; Ounis, Iadh; Sebastiani, Fabrizio.

International Conference on Information and Knowledge Management, Proceedings. Vol. 19-23-Oct-2015 Association for Computing Machinery, 2015. p. 1711-1714.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Berardi, G, Esuli, A, Macdonald, C, Ounis, I & Sebastiani, F 2015, Semi-automated text classification for sensitivity identification. in International Conference on Information and Knowledge Management, Proceedings. vol. 19-23-Oct-2015, Association for Computing Machinery, pp. 1711-1714, 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, Australia, 19/10/15. https://doi.org/10.1145/2806416.2806597
Berardi G, Esuli A, Macdonald C, Ounis I, Sebastiani F. Semi-automated text classification for sensitivity identification. In International Conference on Information and Knowledge Management, Proceedings. Vol. 19-23-Oct-2015. Association for Computing Machinery. 2015. p. 1711-1714 https://doi.org/10.1145/2806416.2806597
Berardi, Giacomo ; Esuli, Andrea ; Macdonald, Craig ; Ounis, Iadh ; Sebastiani, Fabrizio. / Semi-automated text classification for sensitivity identification. International Conference on Information and Knowledge Management, Proceedings. Vol. 19-23-Oct-2015 Association for Computing Machinery, 2015. pp. 1711-1714
@inproceedings{dee62e9dfe1b4969b7132a7ad66ad776,
title = "Semi-automated text classification for sensitivity identification",
abstract = "Sensitive documents are those that cannot be made public, e.g., for personal or organizational privacy reasons. For instance, documents requested through Freedom of Information mechanisms must be manually reviewed for the presence of sensitive information before their actual release. Hence, tools that can assist human reviewers in spotting sensitive information are of great value to government organizations subject to Freedom of Information laws. We look at sensitivity identification in terms of semi-automated text classification (SATC), the task of ranking automatically classified documents so as to optimize the cost-effectiveness of human post-checking work. We use a recently proposed utility-theoretic approach to SATC that explicitly optimizes the chosen effectiveness function when ranking the documents by sensitivity; this is especially useful in our case, since sensitivity identification is a recall-oriented task, thus requiring the use of a recall-oriented evaluation measure such as F2. We show the validity of this approach by running experiments on a multi-label multi-class dataset of government documents manually annotated according to different types of sensitivity.",
author = "Giacomo Berardi and Andrea Esuli and Craig Macdonald and Iadh Ounis and Fabrizio Sebastiani",
year = "2015",
month = "10",
day = "17",
doi = "10.1145/2806416.2806597",
language = "English",
isbn = "9781450337946",
volume = "19-23-Oct-2015",
pages = "1711--1714",
booktitle = "International Conference on Information and Knowledge Management, Proceedings",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - Semi-automated text classification for sensitivity identification

AU - Berardi, Giacomo

AU - Esuli, Andrea

AU - Macdonald, Craig

AU - Ounis, Iadh

AU - Sebastiani, Fabrizio

PY - 2015/10/17

Y1 - 2015/10/17

N2 - Sensitive documents are those that cannot be made public, e.g., for personal or organizational privacy reasons. For instance, documents requested through Freedom of Information mechanisms must be manually reviewed for the presence of sensitive information before their actual release. Hence, tools that can assist human reviewers in spotting sensitive information are of great value to government organizations subject to Freedom of Information laws. We look at sensitivity identification in terms of semi-automated text classification (SATC), the task of ranking automatically classified documents so as to optimize the cost-effectiveness of human post-checking work. We use a recently proposed utility-theoretic approach to SATC that explicitly optimizes the chosen effectiveness function when ranking the documents by sensitivity; this is especially useful in our case, since sensitivity identification is a recall-oriented task, thus requiring the use of a recall-oriented evaluation measure such as F2. We show the validity of this approach by running experiments on a multi-label multi-class dataset of government documents manually annotated according to different types of sensitivity.

AB - Sensitive documents are those that cannot be made public, e.g., for personal or organizational privacy reasons. For instance, documents requested through Freedom of Information mechanisms must be manually reviewed for the presence of sensitive information before their actual release. Hence, tools that can assist human reviewers in spotting sensitive information are of great value to government organizations subject to Freedom of Information laws. We look at sensitivity identification in terms of semi-automated text classification (SATC), the task of ranking automatically classified documents so as to optimize the cost-effectiveness of human post-checking work. We use a recently proposed utility-theoretic approach to SATC that explicitly optimizes the chosen effectiveness function when ranking the documents by sensitivity; this is especially useful in our case, since sensitivity identification is a recall-oriented task, thus requiring the use of a recall-oriented evaluation measure such as F2. We show the validity of this approach by running experiments on a multi-label multi-class dataset of government documents manually annotated according to different types of sensitivity.

UR - http://www.scopus.com/inward/record.url?scp=84959291529&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959291529&partnerID=8YFLogxK

U2 - 10.1145/2806416.2806597

DO - 10.1145/2806416.2806597

M3 - Conference contribution

AN - SCOPUS:84959291529

SN - 9781450337946

VL - 19-23-Oct-2015

SP - 1711

EP - 1714

BT - International Conference on Information and Knowledge Management, Proceedings

PB - Association for Computing Machinery

ER -