Topic segmentation with shared topic detection and alignment of multiple documents

Bingjun Sun, Prasenjit Mitra, C. Lee Giles, John Yen, Hongyuan Zha

Research output: Chapter in Book/Report/Conference proceedingChapter

32 Citations (Scopus)

Abstract

Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available in many domains. In this paper, we introduce a novel unsupervised method for shared topic detection and topic segmentation of multiple similar documents based on mutual information (MI) and weighted mutual information (WMI) that is a combination of MI and term weights. The basic idea is that the optimal segmentation maximizes MI (or WMI). Our approach can detect shared topics among documents. It can find the optimal boundaries in a document, and align segments among documents at the same time. It also can handle single-document segmentation as a special case of the multi-document segmentation and alignment. Our methods can identify and strengthen cue terms that can be used for segmentation and partially remove stop words by using term weights based on entropy learned from multiple documents. Our experimental results show that our algorithm works well for the tasks of single-document segmentation, shared topic detection, and multi-document segmentation. Utilizing information from multiple documents can tremendously improve the performance of topic segmentation, and using WMI is even better than using MI for the multi-document segmentation.

Original languageEnglish
Title of host publicationProceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
Pages199-206
Number of pages8
DOIs
Publication statusPublished - 2007
Externally publishedYes
Event30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07 - Amsterdam
Duration: 23 Jul 200727 Jul 2007

Other

Other30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
CityAmsterdam
Period23/7/0727/7/07

    Fingerprint

Keywords

  • Multiple documents
  • Mutual information
  • Shared topic detection
  • Term weight
  • Topic alignment
  • Topic segmentation

ASJC Scopus subject areas

  • Information Systems
  • Software
  • Applied Mathematics

Cite this

Sun, B., Mitra, P., Giles, C. L., Yen, J., & Zha, H. (2007). Topic segmentation with shared topic detection and alignment of multiple documents. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07 (pp. 199-206) https://doi.org/10.1145/1277741.1277778