Semantic-based multilingual document clustering via tensor modeling

Salvatore Romeo, Andrea Tagarelli, Dino Ienco

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

A major challenge in document clustering research arises from the growing amount of text data written in different languages. Previous approaches depend on language-specific solutions (e.g., bilingual dictionaries, sequential machine translation) to evaluate document similarities, and the required transformations may alter the original document semantics. To cope with this issue we propose a new document clustering approach for multilingual corpora that (i) exploits a large-scale multilingual knowledge base, (ii) takes advantage of the multi-topic nature of the text documents, and (iii) employs a tensor-based model to deal with high dimensionality and sparseness. Results have shown the significance of our approach and its better performance w.r.t. classic document clustering approaches, in both a balanced and an unbalanced corpus evaluation.

Original languageEnglish
Title of host publicationEMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages600-609
Number of pages10
ISBN (Electronic)9781937284961
Publication statusPublished - 1 Jan 2014
Externally publishedYes
Event2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014 - Doha, Qatar
Duration: 25 Oct 201429 Oct 2014

Other

Other2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014
CountryQatar
CityDoha
Period25/10/1429/10/14

Fingerprint

Sequential machines
Glossaries
Tensors
Semantics

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Vision and Pattern Recognition
  • Information Systems

Cite this

Romeo, S., Tagarelli, A., & Ienco, D. (2014). Semantic-based multilingual document clustering via tensor modeling. In EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 600-609). Association for Computational Linguistics (ACL).

Semantic-based multilingual document clustering via tensor modeling. / Romeo, Salvatore; Tagarelli, Andrea; Ienco, Dino.

EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2014. p. 600-609.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Romeo, S, Tagarelli, A & Ienco, D 2014, Semantic-based multilingual document clustering via tensor modeling. in EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL), pp. 600-609, 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Doha, Qatar, 25/10/14.
Romeo S, Tagarelli A, Ienco D. Semantic-based multilingual document clustering via tensor modeling. In EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL). 2014. p. 600-609
Romeo, Salvatore ; Tagarelli, Andrea ; Ienco, Dino. / Semantic-based multilingual document clustering via tensor modeling. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2014. pp. 600-609
@inproceedings{5f425c5d1cf0410f95c9bf80f4618090,
title = "Semantic-based multilingual document clustering via tensor modeling",
abstract = "A major challenge in document clustering research arises from the growing amount of text data written in different languages. Previous approaches depend on language-specific solutions (e.g., bilingual dictionaries, sequential machine translation) to evaluate document similarities, and the required transformations may alter the original document semantics. To cope with this issue we propose a new document clustering approach for multilingual corpora that (i) exploits a large-scale multilingual knowledge base, (ii) takes advantage of the multi-topic nature of the text documents, and (iii) employs a tensor-based model to deal with high dimensionality and sparseness. Results have shown the significance of our approach and its better performance w.r.t. classic document clustering approaches, in both a balanced and an unbalanced corpus evaluation.",
author = "Salvatore Romeo and Andrea Tagarelli and Dino Ienco",
year = "2014",
month = "1",
day = "1",
language = "English",
pages = "600--609",
booktitle = "EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference",
publisher = "Association for Computational Linguistics (ACL)",

}

TY - GEN

T1 - Semantic-based multilingual document clustering via tensor modeling

AU - Romeo, Salvatore

AU - Tagarelli, Andrea

AU - Ienco, Dino

PY - 2014/1/1

Y1 - 2014/1/1

N2 - A major challenge in document clustering research arises from the growing amount of text data written in different languages. Previous approaches depend on language-specific solutions (e.g., bilingual dictionaries, sequential machine translation) to evaluate document similarities, and the required transformations may alter the original document semantics. To cope with this issue we propose a new document clustering approach for multilingual corpora that (i) exploits a large-scale multilingual knowledge base, (ii) takes advantage of the multi-topic nature of the text documents, and (iii) employs a tensor-based model to deal with high dimensionality and sparseness. Results have shown the significance of our approach and its better performance w.r.t. classic document clustering approaches, in both a balanced and an unbalanced corpus evaluation.

AB - A major challenge in document clustering research arises from the growing amount of text data written in different languages. Previous approaches depend on language-specific solutions (e.g., bilingual dictionaries, sequential machine translation) to evaluate document similarities, and the required transformations may alter the original document semantics. To cope with this issue we propose a new document clustering approach for multilingual corpora that (i) exploits a large-scale multilingual knowledge base, (ii) takes advantage of the multi-topic nature of the text documents, and (iii) employs a tensor-based model to deal with high dimensionality and sparseness. Results have shown the significance of our approach and its better performance w.r.t. classic document clustering approaches, in both a balanced and an unbalanced corpus evaluation.

UR - http://www.scopus.com/inward/record.url?scp=84925417669&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84925417669&partnerID=8YFLogxK

M3 - Conference contribution

SP - 600

EP - 609

BT - EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

PB - Association for Computational Linguistics (ACL)

ER -