Combining Hashing and Abstraction in Sparse High Dimensional Feature Spaces Page: 1
The following text was automatically extracted from the image on this page using optical character recognition software:
Combining Hashing and Abstraction in Sparse High Dimensional Feature Spaces
The Pennsylvania State University
College Park, PA 16801, USA
Washington, D.C., USA
The Pennsylvania State University
College Park, PA 16801, USA
With the exponential increase in the number of docu-
ments available online, e.g., news articles, weblogs, sci-
entific documents, the development of effective and effi-
cient classification methods is needed. The performance
of document classifiers critically depends, among other
things, on the choice of the feature representation. The
commonly used "bag of words" and n-gram representa-
tions can result in prohibitively high dimensional input
spaces. Data mining algorithms applied to these input
spaces may be intractable due to the large number of
dimensions. Thus, dimensionality reduction algorithms
that can process data into features fast at runtime, ide-
ally in constant time per feature, are greatly needed in
high throughput applications, where the number of fea-
tures and data points can be in the order of millions. One
promising line of research to dimensionality reduction
is feature clustering. We propose to combine two types
of feature clustering, namely hashing and abstraction
based on hierarchical agglomerative clustering, in order
to take advantage of the strengths of both techniques.
Experimental results on two text data sets show that the
combined approach uses significantly smaller number
of features and gives similar performance when com-
pared with the "bag of words" and n-gram approaches.
Recent World Wide Web advances have resulted in large
amounts of online text data such as news articles, blogs, and
scientific documents. The proliferation of such data poses
several challenges to the data mining community. In particu-
lar, these data require effective and efficient methods for text
classification, ranking, organization, indexing, and summa-
rization. The "bag of words" and n-gram feature represen-
tations, commonly used for text classification, usually result
in prohibitively high dimensional input spaces. Data mining
algorithms applied to these input spaces may be intractable
due to the large number of dimensions. Hence, using dimen-
sionality reduction techniques can be crucial for the perfor-
mance and the complexity of the learning algorithms.
The main idea behind dimensionality reduction is to
project the original high-dimensional data into a lower-
dimensional space, in which patterns in the data can be
Copyright 2012, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
more easily identified. Models such as Principal Component
Analysis (PCA) (Jolliffe 1986), Latent Dirichlet Allocation
(LDA) (Blei, Ng, and Jordan 2003), Probabilistic Latent Se-
mantic Analysis (PLSA) (Hofmann 1999), Latent Semantic
Indexing (LSI) (Deerwester et al. 1990) are widely used to
perform dimensionality reduction for text classification.
Unfortunately, for very high-dimensional data, with hun-
dreds of thousands of dimensions, processing data instances
into feature vectors at runtime, using these models, is com-
putationally very expensive. For example, given a learned
LDA model M and a new instance x, representing x as a
feature vector requires inference of unknown hidden topics
0, i.e., the posterior probability p(0x, M) needs to be esti-
mated. Because the number of new instances can be in the
order of millions in high throughput data mining applica-
tions such as online text classification, ranking, ad selection,
these approaches are not necessarily very efficient. There-
fore, dimensionality reduction algorithms that can process
data into features fast at runtime, ideally in constant time
per feature, are greatly needed.
One promising line of research to dimensionality reduc-
tion is feature clustering. Two types of feature clustering are
by bottom-up hierarchical agglomerative clustering (Duda,
Hart, and Stork 2001) (e.g., using the Jensen-Shannon diver-
gence (Lin 1991)), and by hashing (or random clustering).
Feature clustering by hierarchical agglomerative cluster-
ing finds clusters of "similar" features, called abstractions,
where the feature similarity is measured by distance met-
rics such as the Jensen-Shannon divergence between the
probability distributions of the class variable given the fea-
tures (Baker and McCallum 1998; Silvescu, Caragea, and
Honavar 2009; Slonim and Tishby 1999). High-dimensional
feature vectors are then compressed by abstracting out the
differences between features in the same cluster and adding
up their counts. We will refer to this as feature abstraction.
Feature abstraction effectively reduces the number of fea-
tures by one to three orders of magnitude without sacrific-
ing classification performance (Baker and McCallum 1998;
Silvescu, Caragea, and Honavar 2009), while maintaing a
constant processing time per feature at runtime (i.e., evalu-
ating abstractions needs only an array entry look-up). How-
ever, one of the main shortcomings of feature abstraction is
that, for very high dimensional spaces, e.g., d = 226, it re-
quires O(d) space to store the array that specifies the mem-
Here’s what’s next.
This paper can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Paper.
Caragea, Cornelia; Silvescu, Adrian & Mitra, Prasenjit. Combining Hashing and Abstraction in Sparse High Dimensional Feature Spaces, paper, 2012; [Palo Alto, California]. (digital.library.unt.edu/ark:/67531/metadc181674/m1/1/: accessed November 18, 2018), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT College of Engineering.