Secure similar document detection with simhash

Sahin Buyrukbilen, Spiridon Bakiras

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)

Abstract

Similar document detection is a well-studied problem with important application domains, such as plagiarism detection, document archiving, and patent/copyright protection. Recently, the research focus has shifted towards the privacy-preserving version of the problem, in which two parties want to identify similar documents within their respective datasets. These methods apply to scenarios such as patent protection or intelligence collaboration, where the contents of the documents at both parties should be kept secret. Nevertheless, existing protocols on secure similar document detection suffer from high computational and/or communication costs, which renders them impractical for large datasets. In this work, we introduce a solution based on simhash document fingerprints, which essentially reduce the problem to a secure XOR computation between two bit vectors. Our experimental results demonstrate that the proposed method improves the computational and communication costs by at least one order of magnitude compared to the current state-of-the-art protocol. Moreover, it achieves a high level of precision and recall.

Original languageEnglish
Title of host publicationSecure Data Management - 10th VLDB Workshop, SDM 2013, Proceedings
PublisherSpringer Verlag
Pages61-75
Number of pages15
Volume8425 LNCS
ISBN (Print)9783319068107
DOIs
Publication statusPublished - 2014
Externally publishedYes
Event10th VLDB Workshop on Secure Data Management, SDM 2013 - Trento, Italy
Duration: 30 Aug 201330 Aug 2013

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8425 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other10th VLDB Workshop on Secure Data Management, SDM 2013
CountryItaly
CityTrento
Period30/8/1330/8/13

Fingerprint

Patents
Communication Cost
Communication
Secure Computation
Copyright Protection
Costs
Privacy Preserving
Fingerprint
Large Data Sets
Computational Cost
Scenarios
Experimental Results
Demonstrate
Intelligence
Collaboration

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Buyrukbilen, S., & Bakiras, S. (2014). Secure similar document detection with simhash. In Secure Data Management - 10th VLDB Workshop, SDM 2013, Proceedings (Vol. 8425 LNCS, pp. 61-75). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8425 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-06811-4_12

Secure similar document detection with simhash. / Buyrukbilen, Sahin; Bakiras, Spiridon.

Secure Data Management - 10th VLDB Workshop, SDM 2013, Proceedings. Vol. 8425 LNCS Springer Verlag, 2014. p. 61-75 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8425 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Buyrukbilen, S & Bakiras, S 2014, Secure similar document detection with simhash. in Secure Data Management - 10th VLDB Workshop, SDM 2013, Proceedings. vol. 8425 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8425 LNCS, Springer Verlag, pp. 61-75, 10th VLDB Workshop on Secure Data Management, SDM 2013, Trento, Italy, 30/8/13. https://doi.org/10.1007/978-3-319-06811-4_12
Buyrukbilen S, Bakiras S. Secure similar document detection with simhash. In Secure Data Management - 10th VLDB Workshop, SDM 2013, Proceedings. Vol. 8425 LNCS. Springer Verlag. 2014. p. 61-75. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-06811-4_12
Buyrukbilen, Sahin ; Bakiras, Spiridon. / Secure similar document detection with simhash. Secure Data Management - 10th VLDB Workshop, SDM 2013, Proceedings. Vol. 8425 LNCS Springer Verlag, 2014. pp. 61-75 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{78075b8c045a42a68f7d3ffb1001bf61,
title = "Secure similar document detection with simhash",
abstract = "Similar document detection is a well-studied problem with important application domains, such as plagiarism detection, document archiving, and patent/copyright protection. Recently, the research focus has shifted towards the privacy-preserving version of the problem, in which two parties want to identify similar documents within their respective datasets. These methods apply to scenarios such as patent protection or intelligence collaboration, where the contents of the documents at both parties should be kept secret. Nevertheless, existing protocols on secure similar document detection suffer from high computational and/or communication costs, which renders them impractical for large datasets. In this work, we introduce a solution based on simhash document fingerprints, which essentially reduce the problem to a secure XOR computation between two bit vectors. Our experimental results demonstrate that the proposed method improves the computational and communication costs by at least one order of magnitude compared to the current state-of-the-art protocol. Moreover, it achieves a high level of precision and recall.",
author = "Sahin Buyrukbilen and Spiridon Bakiras",
year = "2014",
doi = "10.1007/978-3-319-06811-4_12",
language = "English",
isbn = "9783319068107",
volume = "8425 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "61--75",
booktitle = "Secure Data Management - 10th VLDB Workshop, SDM 2013, Proceedings",

}

TY - GEN

T1 - Secure similar document detection with simhash

AU - Buyrukbilen, Sahin

AU - Bakiras, Spiridon

PY - 2014

Y1 - 2014

N2 - Similar document detection is a well-studied problem with important application domains, such as plagiarism detection, document archiving, and patent/copyright protection. Recently, the research focus has shifted towards the privacy-preserving version of the problem, in which two parties want to identify similar documents within their respective datasets. These methods apply to scenarios such as patent protection or intelligence collaboration, where the contents of the documents at both parties should be kept secret. Nevertheless, existing protocols on secure similar document detection suffer from high computational and/or communication costs, which renders them impractical for large datasets. In this work, we introduce a solution based on simhash document fingerprints, which essentially reduce the problem to a secure XOR computation between two bit vectors. Our experimental results demonstrate that the proposed method improves the computational and communication costs by at least one order of magnitude compared to the current state-of-the-art protocol. Moreover, it achieves a high level of precision and recall.

AB - Similar document detection is a well-studied problem with important application domains, such as plagiarism detection, document archiving, and patent/copyright protection. Recently, the research focus has shifted towards the privacy-preserving version of the problem, in which two parties want to identify similar documents within their respective datasets. These methods apply to scenarios such as patent protection or intelligence collaboration, where the contents of the documents at both parties should be kept secret. Nevertheless, existing protocols on secure similar document detection suffer from high computational and/or communication costs, which renders them impractical for large datasets. In this work, we introduce a solution based on simhash document fingerprints, which essentially reduce the problem to a secure XOR computation between two bit vectors. Our experimental results demonstrate that the proposed method improves the computational and communication costs by at least one order of magnitude compared to the current state-of-the-art protocol. Moreover, it achieves a high level of precision and recall.

UR - http://www.scopus.com/inward/record.url?scp=84902436811&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84902436811&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-06811-4_12

DO - 10.1007/978-3-319-06811-4_12

M3 - Conference contribution

AN - SCOPUS:84902436811

SN - 9783319068107

VL - 8425 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 61

EP - 75

BT - Secure Data Management - 10th VLDB Workshop, SDM 2013, Proceedings

PB - Springer Verlag

ER -