Protein sequence classification using feature hashing

Cornelia Caragea, Adrian Silvescu, Prasenjit Mitra

Research output: Chapter in Book/Report/Conference proceedingConference contribution

21 Citations (Scopus)

Abstract

Recent advances in next-generation sequencing technologies have resulted in an exponential increase in protein sequence data. The k-gram representation, used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. We study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is reduced by mapping features to hash keys, such that multiple features can be mapped (at random) to the same key, and aggregating their counts. We compare feature hashing with the bag of k-grams and feature selection approaches. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.

Original languageEnglish
Title of host publicationProceedings - 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011
Pages538-543
Number of pages6
DOIs
Publication statusPublished - 2011
Externally publishedYes
Event2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011 - Atlanta, GA, United States
Duration: 12 Nov 201115 Nov 2011

Other

Other2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011
CountryUnited States
CityAtlanta, GA
Period12/11/1115/11/11

Fingerprint

Proteins
Data Mining
Learning algorithms
Data mining
Feature extraction
Learning
Technology

Keywords

  • dimensionality reduction
  • feature hashing
  • variable length k-grams

ASJC Scopus subject areas

  • Biomedical Engineering
  • Health Informatics
  • Health Information Management

Cite this

Caragea, C., Silvescu, A., & Mitra, P. (2011). Protein sequence classification using feature hashing. In Proceedings - 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011 (pp. 538-543). [6120498] https://doi.org/10.1109/BIBM.2011.91

Protein sequence classification using feature hashing. / Caragea, Cornelia; Silvescu, Adrian; Mitra, Prasenjit.

Proceedings - 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011. 2011. p. 538-543 6120498.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Caragea, C, Silvescu, A & Mitra, P 2011, Protein sequence classification using feature hashing. in Proceedings - 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011., 6120498, pp. 538-543, 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011, Atlanta, GA, United States, 12/11/11. https://doi.org/10.1109/BIBM.2011.91
Caragea C, Silvescu A, Mitra P. Protein sequence classification using feature hashing. In Proceedings - 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011. 2011. p. 538-543. 6120498 https://doi.org/10.1109/BIBM.2011.91
Caragea, Cornelia ; Silvescu, Adrian ; Mitra, Prasenjit. / Protein sequence classification using feature hashing. Proceedings - 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011. 2011. pp. 538-543
@inproceedings{ccb2bc82f55143b6843ab5bc09b58d02,
title = "Protein sequence classification using feature hashing",
abstract = "Recent advances in next-generation sequencing technologies have resulted in an exponential increase in protein sequence data. The k-gram representation, used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. We study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is reduced by mapping features to hash keys, such that multiple features can be mapped (at random) to the same key, and aggregating their counts. We compare feature hashing with the bag of k-grams and feature selection approaches. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.",
keywords = "dimensionality reduction, feature hashing, variable length k-grams",
author = "Cornelia Caragea and Adrian Silvescu and Prasenjit Mitra",
year = "2011",
doi = "10.1109/BIBM.2011.91",
language = "English",
isbn = "9780769545745",
pages = "538--543",
booktitle = "Proceedings - 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011",

}

TY - GEN

T1 - Protein sequence classification using feature hashing

AU - Caragea, Cornelia

AU - Silvescu, Adrian

AU - Mitra, Prasenjit

PY - 2011

Y1 - 2011

N2 - Recent advances in next-generation sequencing technologies have resulted in an exponential increase in protein sequence data. The k-gram representation, used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. We study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is reduced by mapping features to hash keys, such that multiple features can be mapped (at random) to the same key, and aggregating their counts. We compare feature hashing with the bag of k-grams and feature selection approaches. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.

AB - Recent advances in next-generation sequencing technologies have resulted in an exponential increase in protein sequence data. The k-gram representation, used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. We study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is reduced by mapping features to hash keys, such that multiple features can be mapped (at random) to the same key, and aggregating their counts. We compare feature hashing with the bag of k-grams and feature selection approaches. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.

KW - dimensionality reduction

KW - feature hashing

KW - variable length k-grams

UR - http://www.scopus.com/inward/record.url?scp=84856044057&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84856044057&partnerID=8YFLogxK

U2 - 10.1109/BIBM.2011.91

DO - 10.1109/BIBM.2011.91

M3 - Conference contribution

SN - 9780769545745

SP - 538

EP - 543

BT - Proceedings - 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011

ER -