Cross-language high similarity search using a conceptual thesaurus

Parth Gupta, Alberto Barron, Paolo Rosso

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Citations (Scopus)

Abstract

This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages67-75
Number of pages9
Volume7488 LNCS
DOIs
Publication statusPublished - 2012
Externally publishedYes
Event3rd International Conference of the CLEF Initiative, CLEF 2012 - Rome
Duration: 17 Sep 201220 Sep 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7488 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other3rd International Conference of the CLEF Initiative, CLEF 2012
CityRome
Period17/9/1220/9/12

Fingerprint

Thesauri
Thesaurus
Similarity Search
Model
Language
Data storage equipment
Evaluate

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Gupta, P., Barron, A., & Rosso, P. (2012). Cross-language high similarity search using a conceptual thesaurus. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7488 LNCS, pp. 67-75). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7488 LNCS). https://doi.org/10.1007/978-3-642-33247-0_8

Cross-language high similarity search using a conceptual thesaurus. / Gupta, Parth; Barron, Alberto; Rosso, Paolo.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7488 LNCS 2012. p. 67-75 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7488 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Gupta, P, Barron, A & Rosso, P 2012, Cross-language high similarity search using a conceptual thesaurus. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 7488 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7488 LNCS, pp. 67-75, 3rd International Conference of the CLEF Initiative, CLEF 2012, Rome, 17/9/12. https://doi.org/10.1007/978-3-642-33247-0_8
Gupta P, Barron A, Rosso P. Cross-language high similarity search using a conceptual thesaurus. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7488 LNCS. 2012. p. 67-75. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-33247-0_8
Gupta, Parth ; Barron, Alberto ; Rosso, Paolo. / Cross-language high similarity search using a conceptual thesaurus. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7488 LNCS 2012. pp. 67-75 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{e87fb885a3c34e558822272457e9bdea,
title = "Cross-language high similarity search using a conceptual thesaurus",
abstract = "This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.",
author = "Parth Gupta and Alberto Barron and Paolo Rosso",
year = "2012",
doi = "10.1007/978-3-642-33247-0_8",
language = "English",
isbn = "9783642332463",
volume = "7488 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "67--75",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Cross-language high similarity search using a conceptual thesaurus

AU - Gupta, Parth

AU - Barron, Alberto

AU - Rosso, Paolo

PY - 2012

Y1 - 2012

N2 - This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.

AB - This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.

UR - http://www.scopus.com/inward/record.url?scp=84867660265&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84867660265&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-33247-0_8

DO - 10.1007/978-3-642-33247-0_8

M3 - Conference contribution

AN - SCOPUS:84867660265

SN - 9783642332463

VL - 7488 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 67

EP - 75

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -