The Case for being average

A mediocrity approach to style masking and author obfuscation: (Best of the Labs Track at CLEF-2017)

Georgi Karadzhov, Tsvetomila Mihaylova, Yasen Kiprov, Georgi Georgiev, Ivan Koychev, Preslav Nakov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Users posting online expect to remain anonymous unless they have logged in, which is often needed for them to be able to discuss freely on various topics. Preserving the anonymity of a text’s writer can be also important in some other contexts, e.g., in the case of witness protection or anonymity programs. However, each person has his/her own style of writing, which can be analyzed using stylometry, and as a result, the true identity of the author of a piece of text can be revealed even if s/he has tried to hide it. Thus, it could be helpful to design automatic tools that can help a person obfuscate his/her identity when writing text. In particular, here we propose an approach that changes the text, so that it is pushed towards average values for some general stylometric characteristics, thus making the use of these characteristics less discriminative. The approach consists of three main steps: first, we calculate the values for some popular stylometric metrics that can indicate authorship; then we apply various transformations to the text, so that these metrics are adjusted towards the average level, while preserving the semantics and the soundness of the text; and finally, we add random noise. This approach turned out to be very efficient, and yielded the best performance on the Author Obfuscation task at the PAN-2016 competition.

Original languageEnglish
Title of host publicationExperimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings
PublisherSpringer Verlag
Pages173-185
Number of pages13
ISBN (Print)9783319658124
DOIs
Publication statusPublished - 1 Jan 2017
Event8th International Conference of the CLEF Association, CLEF 2017 - Dublin, Ireland
Duration: 11 Sep 201714 Sep 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10456 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other8th International Conference of the CLEF Association, CLEF 2017
CountryIreland
CityDublin
Period11/9/1714/9/17

Fingerprint

Obfuscation
Masking
Semantics
Anonymity
Person
Metric
Random Noise
Soundness
Text
Style
Calculate

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Karadzhov, G., Mihaylova, T., Kiprov, Y., Georgiev, G., Koychev, I., & Nakov, P. (2017). The Case for being average: A mediocrity approach to style masking and author obfuscation: (Best of the Labs Track at CLEF-2017). In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings (pp. 173-185). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10456 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-65813-1_18

The Case for being average : A mediocrity approach to style masking and author obfuscation: (Best of the Labs Track at CLEF-2017). / Karadzhov, Georgi; Mihaylova, Tsvetomila; Kiprov, Yasen; Georgiev, Georgi; Koychev, Ivan; Nakov, Preslav.

Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings. Springer Verlag, 2017. p. 173-185 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10456 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Karadzhov, G, Mihaylova, T, Kiprov, Y, Georgiev, G, Koychev, I & Nakov, P 2017, The Case for being average: A mediocrity approach to style masking and author obfuscation: (Best of the Labs Track at CLEF-2017). in Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10456 LNCS, Springer Verlag, pp. 173-185, 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, 11/9/17. https://doi.org/10.1007/978-3-319-65813-1_18
Karadzhov G, Mihaylova T, Kiprov Y, Georgiev G, Koychev I, Nakov P. The Case for being average: A mediocrity approach to style masking and author obfuscation: (Best of the Labs Track at CLEF-2017). In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings. Springer Verlag. 2017. p. 173-185. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-65813-1_18
Karadzhov, Georgi ; Mihaylova, Tsvetomila ; Kiprov, Yasen ; Georgiev, Georgi ; Koychev, Ivan ; Nakov, Preslav. / The Case for being average : A mediocrity approach to style masking and author obfuscation: (Best of the Labs Track at CLEF-2017). Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings. Springer Verlag, 2017. pp. 173-185 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{b6b12140c8294fc883c7f0dcfa1858b9,
title = "The Case for being average: A mediocrity approach to style masking and author obfuscation: (Best of the Labs Track at CLEF-2017)",
abstract = "Users posting online expect to remain anonymous unless they have logged in, which is often needed for them to be able to discuss freely on various topics. Preserving the anonymity of a text’s writer can be also important in some other contexts, e.g., in the case of witness protection or anonymity programs. However, each person has his/her own style of writing, which can be analyzed using stylometry, and as a result, the true identity of the author of a piece of text can be revealed even if s/he has tried to hide it. Thus, it could be helpful to design automatic tools that can help a person obfuscate his/her identity when writing text. In particular, here we propose an approach that changes the text, so that it is pushed towards average values for some general stylometric characteristics, thus making the use of these characteristics less discriminative. The approach consists of three main steps: first, we calculate the values for some popular stylometric metrics that can indicate authorship; then we apply various transformations to the text, so that these metrics are adjusted towards the average level, while preserving the semantics and the soundness of the text; and finally, we add random noise. This approach turned out to be very efficient, and yielded the best performance on the Author Obfuscation task at the PAN-2016 competition.",
author = "Georgi Karadzhov and Tsvetomila Mihaylova and Yasen Kiprov and Georgi Georgiev and Ivan Koychev and Preslav Nakov",
year = "2017",
month = "1",
day = "1",
doi = "10.1007/978-3-319-65813-1_18",
language = "English",
isbn = "9783319658124",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "173--185",
booktitle = "Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings",

}

TY - GEN

T1 - The Case for being average

T2 - A mediocrity approach to style masking and author obfuscation: (Best of the Labs Track at CLEF-2017)

AU - Karadzhov, Georgi

AU - Mihaylova, Tsvetomila

AU - Kiprov, Yasen

AU - Georgiev, Georgi

AU - Koychev, Ivan

AU - Nakov, Preslav

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Users posting online expect to remain anonymous unless they have logged in, which is often needed for them to be able to discuss freely on various topics. Preserving the anonymity of a text’s writer can be also important in some other contexts, e.g., in the case of witness protection or anonymity programs. However, each person has his/her own style of writing, which can be analyzed using stylometry, and as a result, the true identity of the author of a piece of text can be revealed even if s/he has tried to hide it. Thus, it could be helpful to design automatic tools that can help a person obfuscate his/her identity when writing text. In particular, here we propose an approach that changes the text, so that it is pushed towards average values for some general stylometric characteristics, thus making the use of these characteristics less discriminative. The approach consists of three main steps: first, we calculate the values for some popular stylometric metrics that can indicate authorship; then we apply various transformations to the text, so that these metrics are adjusted towards the average level, while preserving the semantics and the soundness of the text; and finally, we add random noise. This approach turned out to be very efficient, and yielded the best performance on the Author Obfuscation task at the PAN-2016 competition.

AB - Users posting online expect to remain anonymous unless they have logged in, which is often needed for them to be able to discuss freely on various topics. Preserving the anonymity of a text’s writer can be also important in some other contexts, e.g., in the case of witness protection or anonymity programs. However, each person has his/her own style of writing, which can be analyzed using stylometry, and as a result, the true identity of the author of a piece of text can be revealed even if s/he has tried to hide it. Thus, it could be helpful to design automatic tools that can help a person obfuscate his/her identity when writing text. In particular, here we propose an approach that changes the text, so that it is pushed towards average values for some general stylometric characteristics, thus making the use of these characteristics less discriminative. The approach consists of three main steps: first, we calculate the values for some popular stylometric metrics that can indicate authorship; then we apply various transformations to the text, so that these metrics are adjusted towards the average level, while preserving the semantics and the soundness of the text; and finally, we add random noise. This approach turned out to be very efficient, and yielded the best performance on the Author Obfuscation task at the PAN-2016 competition.

UR - http://www.scopus.com/inward/record.url?scp=85029426209&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85029426209&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-65813-1_18

DO - 10.1007/978-3-319-65813-1_18

M3 - Conference contribution

SN - 9783319658124

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 173

EP - 185

BT - Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings

PB - Springer Verlag

ER -