Filling the gaps: Improving wikipedia stubs

Siddhartha Banerjee, Prasenjit Mitra

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

The availability of only a limited number of contributors on Wikipedia cannot ensure consistent growth and improvement of the online encyclopedia. With information being scattered on the web, our goal is to automate the process of generation of content for Wikipedia. In this work, we propose a technique of improving stubs on Wikipedia that do not contain comprehensive information. A classifier learns features from the existing comprehensive articles on Wikipedia and recommends content that can be added to the stubs to improve the completeness of such stubs. We conduct experiments using several classifiers-Latent Dirichlet Allocation (LDA) based model, a deep learning based architecture (Deep belief network) and TFIDF based classifier. Our experiments reveal that the LDA based model outperforms the other models (6% F-score). Our generation ap-proach shows that this technique is capable of generating comprehensive articles. ROUGE-2 scores of the articles generated by our system outperform the articles generated using the baseline. Content generated by our system has been appended to several stubs and successfully retained in Wikipedia.

Original languageEnglish
Title of host publicationDocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering
PublisherAssociation for Computing Machinery, Inc
Pages117-120
Number of pages4
ISBN (Print)9781450333078
DOIs
Publication statusPublished - 8 Sep 2015
EventACM Symposium on Document Engineering, DocEng 2015 - Lausanne, Switzerland
Duration: 8 Sep 201511 Sep 2015

Other

OtherACM Symposium on Document Engineering, DocEng 2015
CountrySwitzerland
CityLausanne
Period8/9/1511/9/15

Fingerprint

Classifiers
Bayesian networks
Experiments
Availability
Deep learning

Keywords

  • Text summarization
  • Topic modeling
  • Wikipedia generation

ASJC Scopus subject areas

  • Information Systems
  • Software

Cite this

Banerjee, S., & Mitra, P. (2015). Filling the gaps: Improving wikipedia stubs. In DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering (pp. 117-120). Association for Computing Machinery, Inc. https://doi.org/10.1145/2682571.2797073

Filling the gaps : Improving wikipedia stubs. / Banerjee, Siddhartha; Mitra, Prasenjit.

DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering. Association for Computing Machinery, Inc, 2015. p. 117-120.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Banerjee, S & Mitra, P 2015, Filling the gaps: Improving wikipedia stubs. in DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering. Association for Computing Machinery, Inc, pp. 117-120, ACM Symposium on Document Engineering, DocEng 2015, Lausanne, Switzerland, 8/9/15. https://doi.org/10.1145/2682571.2797073
Banerjee S, Mitra P. Filling the gaps: Improving wikipedia stubs. In DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering. Association for Computing Machinery, Inc. 2015. p. 117-120 https://doi.org/10.1145/2682571.2797073
Banerjee, Siddhartha ; Mitra, Prasenjit. / Filling the gaps : Improving wikipedia stubs. DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering. Association for Computing Machinery, Inc, 2015. pp. 117-120
@inproceedings{c70266dada07441c8020714eb44e3c22,
title = "Filling the gaps: Improving wikipedia stubs",
abstract = "The availability of only a limited number of contributors on Wikipedia cannot ensure consistent growth and improvement of the online encyclopedia. With information being scattered on the web, our goal is to automate the process of generation of content for Wikipedia. In this work, we propose a technique of improving stubs on Wikipedia that do not contain comprehensive information. A classifier learns features from the existing comprehensive articles on Wikipedia and recommends content that can be added to the stubs to improve the completeness of such stubs. We conduct experiments using several classifiers-Latent Dirichlet Allocation (LDA) based model, a deep learning based architecture (Deep belief network) and TFIDF based classifier. Our experiments reveal that the LDA based model outperforms the other models (6{\%} F-score). Our generation ap-proach shows that this technique is capable of generating comprehensive articles. ROUGE-2 scores of the articles generated by our system outperform the articles generated using the baseline. Content generated by our system has been appended to several stubs and successfully retained in Wikipedia.",
keywords = "Text summarization, Topic modeling, Wikipedia generation",
author = "Siddhartha Banerjee and Prasenjit Mitra",
year = "2015",
month = "9",
day = "8",
doi = "10.1145/2682571.2797073",
language = "English",
isbn = "9781450333078",
pages = "117--120",
booktitle = "DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - Filling the gaps

T2 - Improving wikipedia stubs

AU - Banerjee, Siddhartha

AU - Mitra, Prasenjit

PY - 2015/9/8

Y1 - 2015/9/8

N2 - The availability of only a limited number of contributors on Wikipedia cannot ensure consistent growth and improvement of the online encyclopedia. With information being scattered on the web, our goal is to automate the process of generation of content for Wikipedia. In this work, we propose a technique of improving stubs on Wikipedia that do not contain comprehensive information. A classifier learns features from the existing comprehensive articles on Wikipedia and recommends content that can be added to the stubs to improve the completeness of such stubs. We conduct experiments using several classifiers-Latent Dirichlet Allocation (LDA) based model, a deep learning based architecture (Deep belief network) and TFIDF based classifier. Our experiments reveal that the LDA based model outperforms the other models (6% F-score). Our generation ap-proach shows that this technique is capable of generating comprehensive articles. ROUGE-2 scores of the articles generated by our system outperform the articles generated using the baseline. Content generated by our system has been appended to several stubs and successfully retained in Wikipedia.

AB - The availability of only a limited number of contributors on Wikipedia cannot ensure consistent growth and improvement of the online encyclopedia. With information being scattered on the web, our goal is to automate the process of generation of content for Wikipedia. In this work, we propose a technique of improving stubs on Wikipedia that do not contain comprehensive information. A classifier learns features from the existing comprehensive articles on Wikipedia and recommends content that can be added to the stubs to improve the completeness of such stubs. We conduct experiments using several classifiers-Latent Dirichlet Allocation (LDA) based model, a deep learning based architecture (Deep belief network) and TFIDF based classifier. Our experiments reveal that the LDA based model outperforms the other models (6% F-score). Our generation ap-proach shows that this technique is capable of generating comprehensive articles. ROUGE-2 scores of the articles generated by our system outperform the articles generated using the baseline. Content generated by our system has been appended to several stubs and successfully retained in Wikipedia.

KW - Text summarization

KW - Topic modeling

KW - Wikipedia generation

UR - http://www.scopus.com/inward/record.url?scp=84959229664&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959229664&partnerID=8YFLogxK

U2 - 10.1145/2682571.2797073

DO - 10.1145/2682571.2797073

M3 - Conference contribution

AN - SCOPUS:84959229664

SN - 9781450333078

SP - 117

EP - 120

BT - DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering

PB - Association for Computing Machinery, Inc

ER -