Classifying scientific publications using abstract features

Cornelia Caragea, Adrian Silvescu, Saurabh Kataria, Doina Caragea, Prasenjit Mitra

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

With the exponential increase in the number of documents available online, e.g., news articles, weblogs, scientific documents, effective and efficient classification methods are required in order to deliver the appropriate information to specific users or groups. The performance of document classifiers critically depends, among other things, on the choice of the feature representation. The commonly used "bag of words" representation can result in a large number of features. Feature abstraction helps reduce a classifier input size by learning an abstraction hierarchy over the set of words. A cut through the hierarchy specifies a compressed model, where the nodes on the cut represent abstract features. In this paper, we compare feature abstraction with two other methods for dimensionality reduction, i.e., feature selection and Latent Dirichlet Allocation (LDA). Experimental results on two data sets of scientific publications show that classifiers trained using abstract features significantly outperform those trained using features that have the highest average mutual information with the class, and those trained using the topic distribution and topic words output by LDA. Furthermore, we propose an approach to automatic identification of a cut in order to trade off the complexity of classifiers against their performance. Our results demonstrate the feasibility of the proposed approach.

Original languageEnglish
Title of host publicationSARA 2011 - Proceedings of the 9th Symposium on Abstraction, Reformulation, and Approximation
Pages26-33
Number of pages8
Publication statusPublished - 2011
Externally publishedYes
Event9th Symposium on Abstraction, Reformulation, and Approximation, SARA 2011 - Cardona, Catalonia
Duration: 17 Jul 201118 Jul 2011

Other

Other9th Symposium on Abstraction, Reformulation, and Approximation, SARA 2011
CityCardona, Catalonia
Period17/7/1118/7/11

Fingerprint

Classifiers
Feature extraction

ASJC Scopus subject areas

  • Computer Science Applications

Cite this

Caragea, C., Silvescu, A., Kataria, S., Caragea, D., & Mitra, P. (2011). Classifying scientific publications using abstract features. In SARA 2011 - Proceedings of the 9th Symposium on Abstraction, Reformulation, and Approximation (pp. 26-33)

Classifying scientific publications using abstract features. / Caragea, Cornelia; Silvescu, Adrian; Kataria, Saurabh; Caragea, Doina; Mitra, Prasenjit.

SARA 2011 - Proceedings of the 9th Symposium on Abstraction, Reformulation, and Approximation. 2011. p. 26-33.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Caragea, C, Silvescu, A, Kataria, S, Caragea, D & Mitra, P 2011, Classifying scientific publications using abstract features. in SARA 2011 - Proceedings of the 9th Symposium on Abstraction, Reformulation, and Approximation. pp. 26-33, 9th Symposium on Abstraction, Reformulation, and Approximation, SARA 2011, Cardona, Catalonia, 17/7/11.
Caragea C, Silvescu A, Kataria S, Caragea D, Mitra P. Classifying scientific publications using abstract features. In SARA 2011 - Proceedings of the 9th Symposium on Abstraction, Reformulation, and Approximation. 2011. p. 26-33
Caragea, Cornelia ; Silvescu, Adrian ; Kataria, Saurabh ; Caragea, Doina ; Mitra, Prasenjit. / Classifying scientific publications using abstract features. SARA 2011 - Proceedings of the 9th Symposium on Abstraction, Reformulation, and Approximation. 2011. pp. 26-33
@inproceedings{be4945db525e4df8a01a1e0d54f6d3a9,
title = "Classifying scientific publications using abstract features",
abstract = "With the exponential increase in the number of documents available online, e.g., news articles, weblogs, scientific documents, effective and efficient classification methods are required in order to deliver the appropriate information to specific users or groups. The performance of document classifiers critically depends, among other things, on the choice of the feature representation. The commonly used {"}bag of words{"} representation can result in a large number of features. Feature abstraction helps reduce a classifier input size by learning an abstraction hierarchy over the set of words. A cut through the hierarchy specifies a compressed model, where the nodes on the cut represent abstract features. In this paper, we compare feature abstraction with two other methods for dimensionality reduction, i.e., feature selection and Latent Dirichlet Allocation (LDA). Experimental results on two data sets of scientific publications show that classifiers trained using abstract features significantly outperform those trained using features that have the highest average mutual information with the class, and those trained using the topic distribution and topic words output by LDA. Furthermore, we propose an approach to automatic identification of a cut in order to trade off the complexity of classifiers against their performance. Our results demonstrate the feasibility of the proposed approach.",
author = "Cornelia Caragea and Adrian Silvescu and Saurabh Kataria and Doina Caragea and Prasenjit Mitra",
year = "2011",
language = "English",
isbn = "9781577355434",
pages = "26--33",
booktitle = "SARA 2011 - Proceedings of the 9th Symposium on Abstraction, Reformulation, and Approximation",

}

TY - GEN

T1 - Classifying scientific publications using abstract features

AU - Caragea, Cornelia

AU - Silvescu, Adrian

AU - Kataria, Saurabh

AU - Caragea, Doina

AU - Mitra, Prasenjit

PY - 2011

Y1 - 2011

N2 - With the exponential increase in the number of documents available online, e.g., news articles, weblogs, scientific documents, effective and efficient classification methods are required in order to deliver the appropriate information to specific users or groups. The performance of document classifiers critically depends, among other things, on the choice of the feature representation. The commonly used "bag of words" representation can result in a large number of features. Feature abstraction helps reduce a classifier input size by learning an abstraction hierarchy over the set of words. A cut through the hierarchy specifies a compressed model, where the nodes on the cut represent abstract features. In this paper, we compare feature abstraction with two other methods for dimensionality reduction, i.e., feature selection and Latent Dirichlet Allocation (LDA). Experimental results on two data sets of scientific publications show that classifiers trained using abstract features significantly outperform those trained using features that have the highest average mutual information with the class, and those trained using the topic distribution and topic words output by LDA. Furthermore, we propose an approach to automatic identification of a cut in order to trade off the complexity of classifiers against their performance. Our results demonstrate the feasibility of the proposed approach.

AB - With the exponential increase in the number of documents available online, e.g., news articles, weblogs, scientific documents, effective and efficient classification methods are required in order to deliver the appropriate information to specific users or groups. The performance of document classifiers critically depends, among other things, on the choice of the feature representation. The commonly used "bag of words" representation can result in a large number of features. Feature abstraction helps reduce a classifier input size by learning an abstraction hierarchy over the set of words. A cut through the hierarchy specifies a compressed model, where the nodes on the cut represent abstract features. In this paper, we compare feature abstraction with two other methods for dimensionality reduction, i.e., feature selection and Latent Dirichlet Allocation (LDA). Experimental results on two data sets of scientific publications show that classifiers trained using abstract features significantly outperform those trained using features that have the highest average mutual information with the class, and those trained using the topic distribution and topic words output by LDA. Furthermore, we propose an approach to automatic identification of a cut in order to trade off the complexity of classifiers against their performance. Our results demonstrate the feasibility of the proposed approach.

UR - http://www.scopus.com/inward/record.url?scp=84890574323&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84890574323&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781577355434

SP - 26

EP - 33

BT - SARA 2011 - Proceedings of the 9th Symposium on Abstraction, Reformulation, and Approximation

ER -