Protein Sequence Classification Using Feature Hashing Metadata

Metadata describes a digital item, providing (if known) such information as creator, publisher, contents, size, relationship to other resources, and more. Metadata may also contain "preservation" components that help us to maintain the integrity of digital files over time.


  • Main Title Protein Sequence Classification Using Feature Hashing


  • Author: Caragea, Cornelia
    Creator Type: Personal
    Creator Info: University of North Texas
  • Author: Silvescu, Adrian
    Creator Type: Personal
    Creator Info: Naviance, Inc.
  • Author: Mitra, Prasenjit
    Creator Type: Personal
    Creator Info: Pennsylvania State University


  • Name: BioMed Central Ltd.
    Place of Publication: [London, United Kingdom]


  • Creation: 2012-06-21


  • English


  • Physical Description: 8 p.: ill.
  • Content Description: Article on protein sequence classification using feature hashing.


  • Keyword: feature hashing
  • Keyword: variable length k-grams
  • Keyword: dimensionality reduction


  • Journal: Proteome Science, 2012, London: BioMed Central Ltd.


  • Publication Title: Proteome Science
  • Volume: 10
  • Issue: Suppl 1
  • Pages: 8
  • Peer Reviewed: True


  • Name: UNT Scholarly Works
    Code: UNTSW


  • Name: UNT College of Engineering
    Code: UNTCOE


  • Rights Access: public

Resource Type

  • Article


  • Text


  • DOI: 10.1186/1477-5956-10-S1-S14
  • Archival Resource Key: ark:/67531/metadc181699


  • Academic Department: Computer Science and Engineering


  • Display Note: This article is part of the supplement: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2011: Proteome Science.
  • Display Note: Abstract: Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of a recently introduced feature hashing technique to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features, using a hash function, into a lower-dimensional space, i.e., mapping features to hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" and feature selection approaches. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.