Topic segmentation with shared topic detection and alignment of multiple documents

Bingjun Sun, Prasenjit Mitra, C. Lee Giles, John Yen, Hongyuan Zha

Research output: Chapter in Book/Report/Conference proceedingChapter

32 Citations (Scopus)

Abstract

Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available in many domains. In this paper, we introduce a novel unsupervised method for shared topic detection and topic segmentation of multiple similar documents based on mutual information (MI) and weighted mutual information (WMI) that is a combination of MI and term weights. The basic idea is that the optimal segmentation maximizes MI (or WMI). Our approach can detect shared topics among documents. It can find the optimal boundaries in a document, and align segments among documents at the same time. It also can handle single-document segmentation as a special case of the multi-document segmentation and alignment. Our methods can identify and strengthen cue terms that can be used for segmentation and partially remove stop words by using term weights based on entropy learned from multiple documents. Our experimental results show that our algorithm works well for the tasks of single-document segmentation, shared topic detection, and multi-document segmentation. Utilizing information from multiple documents can tremendously improve the performance of topic segmentation, and using WMI is even better than using MI for the multi-document segmentation.

Original languageEnglish
Title of host publicationProceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
Pages199-206
Number of pages8
DOIs
Publication statusPublished - 2007
Externally publishedYes
Event30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07 - Amsterdam
Duration: 23 Jul 200727 Jul 2007

Other

Other30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
CityAmsterdam
Period23/7/0727/7/07

Fingerprint

Alignment
Entropy
Segmentation
Mutual Information
Term
Maximise
Experimental Results

Keywords

  • Multiple documents
  • Mutual information
  • Shared topic detection
  • Term weight
  • Topic alignment
  • Topic segmentation

ASJC Scopus subject areas

  • Information Systems
  • Software
  • Applied Mathematics

Cite this

Sun, B., Mitra, P., Giles, C. L., Yen, J., & Zha, H. (2007). Topic segmentation with shared topic detection and alignment of multiple documents. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07 (pp. 199-206) https://doi.org/10.1145/1277741.1277778

Topic segmentation with shared topic detection and alignment of multiple documents. / Sun, Bingjun; Mitra, Prasenjit; Giles, C. Lee; Yen, John; Zha, Hongyuan.

Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07. 2007. p. 199-206.

Research output: Chapter in Book/Report/Conference proceedingChapter

Sun, B, Mitra, P, Giles, CL, Yen, J & Zha, H 2007, Topic segmentation with shared topic detection and alignment of multiple documents. in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07. pp. 199-206, 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07, Amsterdam, 23/7/07. https://doi.org/10.1145/1277741.1277778
Sun B, Mitra P, Giles CL, Yen J, Zha H. Topic segmentation with shared topic detection and alignment of multiple documents. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07. 2007. p. 199-206 https://doi.org/10.1145/1277741.1277778
Sun, Bingjun ; Mitra, Prasenjit ; Giles, C. Lee ; Yen, John ; Zha, Hongyuan. / Topic segmentation with shared topic detection and alignment of multiple documents. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07. 2007. pp. 199-206
@inbook{03947197e362417a90bbf35ba4bb037a,
title = "Topic segmentation with shared topic detection and alignment of multiple documents",
abstract = "Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available in many domains. In this paper, we introduce a novel unsupervised method for shared topic detection and topic segmentation of multiple similar documents based on mutual information (MI) and weighted mutual information (WMI) that is a combination of MI and term weights. The basic idea is that the optimal segmentation maximizes MI (or WMI). Our approach can detect shared topics among documents. It can find the optimal boundaries in a document, and align segments among documents at the same time. It also can handle single-document segmentation as a special case of the multi-document segmentation and alignment. Our methods can identify and strengthen cue terms that can be used for segmentation and partially remove stop words by using term weights based on entropy learned from multiple documents. Our experimental results show that our algorithm works well for the tasks of single-document segmentation, shared topic detection, and multi-document segmentation. Utilizing information from multiple documents can tremendously improve the performance of topic segmentation, and using WMI is even better than using MI for the multi-document segmentation.",
keywords = "Multiple documents, Mutual information, Shared topic detection, Term weight, Topic alignment, Topic segmentation",
author = "Bingjun Sun and Prasenjit Mitra and Giles, {C. Lee} and John Yen and Hongyuan Zha",
year = "2007",
doi = "10.1145/1277741.1277778",
language = "English",
isbn = "1595935975",
pages = "199--206",
booktitle = "Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07",

}

TY - CHAP

T1 - Topic segmentation with shared topic detection and alignment of multiple documents

AU - Sun, Bingjun

AU - Mitra, Prasenjit

AU - Giles, C. Lee

AU - Yen, John

AU - Zha, Hongyuan

PY - 2007

Y1 - 2007

N2 - Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available in many domains. In this paper, we introduce a novel unsupervised method for shared topic detection and topic segmentation of multiple similar documents based on mutual information (MI) and weighted mutual information (WMI) that is a combination of MI and term weights. The basic idea is that the optimal segmentation maximizes MI (or WMI). Our approach can detect shared topics among documents. It can find the optimal boundaries in a document, and align segments among documents at the same time. It also can handle single-document segmentation as a special case of the multi-document segmentation and alignment. Our methods can identify and strengthen cue terms that can be used for segmentation and partially remove stop words by using term weights based on entropy learned from multiple documents. Our experimental results show that our algorithm works well for the tasks of single-document segmentation, shared topic detection, and multi-document segmentation. Utilizing information from multiple documents can tremendously improve the performance of topic segmentation, and using WMI is even better than using MI for the multi-document segmentation.

AB - Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available in many domains. In this paper, we introduce a novel unsupervised method for shared topic detection and topic segmentation of multiple similar documents based on mutual information (MI) and weighted mutual information (WMI) that is a combination of MI and term weights. The basic idea is that the optimal segmentation maximizes MI (or WMI). Our approach can detect shared topics among documents. It can find the optimal boundaries in a document, and align segments among documents at the same time. It also can handle single-document segmentation as a special case of the multi-document segmentation and alignment. Our methods can identify and strengthen cue terms that can be used for segmentation and partially remove stop words by using term weights based on entropy learned from multiple documents. Our experimental results show that our algorithm works well for the tasks of single-document segmentation, shared topic detection, and multi-document segmentation. Utilizing information from multiple documents can tremendously improve the performance of topic segmentation, and using WMI is even better than using MI for the multi-document segmentation.

KW - Multiple documents

KW - Mutual information

KW - Shared topic detection

KW - Term weight

KW - Topic alignment

KW - Topic segmentation

UR - http://www.scopus.com/inward/record.url?scp=36448956401&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=36448956401&partnerID=8YFLogxK

U2 - 10.1145/1277741.1277778

DO - 10.1145/1277741.1277778

M3 - Chapter

SN - 1595935975

SN - 9781595935977

SP - 199

EP - 206

BT - Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07

ER -