Combining Hashing and Abstraction in Sparse High Dimensional Feature Spaces Page: 2
The following text was automatically extracted from the image on this page using optical character recognition software:
bership of features into abstractions. For very large values
of d, storing the array and performing array entry look-up
operations may become difficult.
Feature hashing (Shi et al. 2009; Weinberger et al. 2009;
Forman and Kirshenbaum 2008; Langford, Li, and Strehl
2007) offers a very inexpensive, yet effective, approach to
reducing the number of features. Feature hashing allows
random collisions into the latent factors. The original high-
dimensional space is "reduced" by hashing the features, us-
ing a hash function, into a lower-dimensional space, i.e.,
mapping features to hash keys, where multiple features can
be mapped (at random) to the same key, and "aggregating"
their counts. Although this method is very effective in re-
ducing the number of features from very high (e.g., 226) to
mid-size (e.g., 216) dimensions, feature hashing can result
in significant loss of information, especially when hash col-
lisions occur between highly frequent features, which can
have significantly different class distributions.
Given the complementary strengths of hashing and ab-
straction, one question that can be raised is the following:
Can one design an effective and efficient approach to reduce
the number offeature dimensions by exploiting the strengths
of these techniques (hashing and abstraction) and overcome
their limitations? The research that we describe in this paper
addresses specifically this question.
Contributions. We present an approach to dimensionality
reduction that combines hashing and abstraction to gener-
ate accurate and concise models, while maintaing a constant
processing time per feature at runtime. More precisely, we
propose to first hash the original high-dimensional spaces
to mid-size dimensions (e.g., 216 or 214), operation which
does not significantly distort the data, and then use ab-
straction to further reduce the dimensionality for a small or
no loss in performance. The use of hashing minimizes the
space requirement needed by abstraction. The use of abstrac-
tion enforces collisions between features that have "simi-
lar" class distributions, as opposed to allowing random col-
lisions, which could result in significant loss of information.
We empirically show that combining hashing and abstrac-
tion: (a) can result in models that use a significantly smaller
number of features, and have similar performance, or some-
times better, compared to the "bag of words" approach; (b)
can provide a way to handle very high dimensionality that
results from using n-grams (i.e., sequences of n contigu-
ous words); (c) significantly outperforms the combination of
hashing and feature selection (Guyon and Elisseeff 2003).
A variety of approaches to dimensionality reduction have
been studied in the literature as described below.
Feature selection (Guyon and Elisseeff 2003; Yang and
Pederson 1997) reduces the number of features by select-
ing a subset of the available features based on some chosen
criteria. For example, feature selection by average mutual
information selects the top words that have the highest av-
erage mutual information with the class variable (Yang and
Principal ComponentAnalysis (PCA) (Jolliffe 1986) finds
a projection of the original d-dimensional input space into
a lower m-dimensional orthogonal space, which captures
as much of the data variance as possible, by exploiting the
eigen-structure of the data covariance matrix. PCA is com-
putationally very expensive (Golub and van Loan 1983), al-
though less expensive methods for finding only m eigen-
vectors and eigenvalues were developed (see for exam-
ple (Roweis 1998)). Singular Value Decomposition (SVD)
(Berry 1992; Deerwester et al. 1990) is directly applied to
the data matrix. SVD is also computationally expensive, al-
though SVD methods for sparse text data were developed
with smaller computational complexity (Papadimitriou et al.
1998). However, neither PCA nor SVD is efficient to reduce
the dimensionality for a large number of new instances.
Topic models such as Latent Dirichlet Allocation (LDA)
(Blei, Ng, and Jordan 2003), Probabilistic Latent Seman-
tic Analysis (PLSA) (Hofmann 1999), and Latent Semantic
Indexing (LSI) (Deerwester et al. 1990) are dimensionality
reduction models, designed to uncover hidden topics, i.e.,
clusters of semantically related words that co-occur in the
documents. LSI uses SVD to identify topics, which are then
used to represent documents in a low dimensional "topic"
space. Hence, it is also inefficient. LDA models each docu-
ment as a mixture of topics, and each topic as a distribution
over the words in the vocabulary. The topic distribution of
a document can be seen as a low-dimensional "topic" rep-
resentation. However, LDA requires inference at runtime to
estimate the topic distribution.
Random projections (Johnson and Lindenstrauss 1984;
Achlioptas 2003; Liberty, Ailon, and Singer 2008; Bingham
and Mannila 2001; Rahimi and Recht 2008) project high d-
dimensional spaces into lower m-dimensional spaces using
a random m x d matrix R with unit length columns. Bing-
ham and Mannila (2001) showed empirically that random
projections do not significantly distort the data and, although
their performance is not as accurate as that of SVD, they are
computationally less complex than SVD.
Feature hashing (or random clustering). Shi et al. (2009)
and Weinberger et al. (2009) presented hash kernels to map
high dimensional input spaces into low dimensional spaces
and showed them to be highly accurate for large scale clas-
sification and large scale multitask learning when the di-
mension of the low space is sufficiently large. Ganchev and
Dredze (2008) empirically showed that hash features can
produce accurate results on various NLP applications. For-
man and Kirshenbaum (2008) proposed a fast feature ex-
traction approach by combining parsing and hashing for
text classification and indexing. Indyk and Motwani (1998)
and Gionis, Indyk and Motwani (1999) presented Locality-
Sensitive Hashing (LSH), a random hashing/projection tech-
nique for approximate nearest-neighbor query problems.
Objects (e.g., images or text documents) are hashed using
multiple hash functions such that, for each hash function,
collisions are more likely to happen between objects that are
close to each other, rather than objects that are far apart. A
query object is then hashed, using the same hash functions,
and the objects stored in buckets containing the query object
are ranked and retrieved as the nearest neighbors. However,
LSH is computationally more expensive than hash kernels,
which require only adding up the vector coordinates with the
Here’s what’s next.
This paper can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Paper.
Caragea, Cornelia; Silvescu, Adrian & Mitra, Prasenjit. Combining Hashing and Abstraction in Sparse High Dimensional Feature Spaces, paper, 2012; [Palo Alto, California]. (digital.library.unt.edu/ark:/67531/metadc181674/m1/2/: accessed January 19, 2019), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT College of Engineering.