# Combining Hashing and Abstraction in Sparse High Dimensional Feature Spaces Page: 3

This
**paper**
is part of the collection entitled:
UNT Scholarly Works and
was provided to Digital Library
by the UNT College of Engineering.

#### Extracted Text

The following text was automatically extracted from the image on this page using optical character recognition software:

same hash key (Shi et al. 2009).

Feature Abstraction. Information Bottleneck (IB) and its

variants (Tishby, Pereira, and Bialek 1999; Slonim and

Tishby 1999; Pereira, Tishby, and Lee 1993) are distribu-

tional clustering based approaches, designed to find a com-

pression of a variable X while preserving as much infor-

mation as possible about a target variable Y. Baker and Mc-

Callum (1998) applied distributional clustering to reduce the

dimensionality of the feature space for document classifica-

tion tasks. Silvescu et al. (2009) applied agglomerative IB

(Slonim and Tishby 1999) to simplify the data representation

on biological sequence classification tasks. Other feature ab-

straction methods include: automatic construction of word

taxonomies from text data and their usage to text classifi-

cation (Kang et al. 2005); compression of conditional prob-

ability tables in Bayesian networks using abstraction-based

search (desJardins, Getoor, and Koller 2000).

In contrast to the approaches discussed above, we present

a hybrid approach that combines hashing (Shi et al. 2009)

and abstraction (Silvescu, Caragea, and Honavar 2009) to

exploit their strengths, and address and minimize their lim-

itations. As feature selection is yet another accurate and ef-

ficient dimensionality reduction method that can be com-

bined with feature hashing, we compare our approach with

the combination of hashing and feature selection.

Methods

The "bag of words" and n-gram approaches construct a vo-

cabulary of size d, which contains all words or n-grams in

a collection of documents. A document is represented as a

vector x with as many entries as the words or n-grams in the

vocabulary, where an entry k in x can record the frequency

(in the document) of the kth word or n-gram in the vocabu-

lary, denoted by xk. Because only a small number of words

(compared to the vocabulary size) occurs in a document, the

representation of x is very sparse, i.e., only a small number

of entries of x is non-zero. However, storing the parame-

ter vectors in the original input space requires 0(d) num-

bers, which can become difficult given today's very large

collections of documents. The combination of hashing and

abstraction helps reduce the size of the parameter vectors.

Feature Hashing

Feature hashing (Shi et al. 2009; Weinberger et al. 2009;

Forman and Kirshenbaum 2008; Langford, Li, and Strehl

2007) is a dimensionality reduction technique, in which

high-dimensional input vectors x of size d are hashed into

lower-dimensional feature vectors xh of size b. Let S denote

the set of all possible strings and h and ( be two hash func-

tions, such that h : S - {0, , b- 1} and : S - {+1},

respectively. Each token in a document is directly mapped,

using h', into a hash key, which represents the index of the

token in the feature vector Xh, such that the hash key is a

number between 0 and b - 1. Each index in xh stores the

1Note that h can be any hash function, e.g. hashCode () of

the Java String class, or murmurHash function available online

at http://sites.google.com/site/murmurhash/.value ("frequency counts") of the corresponding hash fea-

ture. The hash function ( indicates whether to increment or

decrement the hash dimension of the token, which renders

the hash feature vector xh to be unbiased (see (Weinberger

et al. 2009) for more details).

Thus, an entry i in xh records the "frequency counts" of

tokens that are hashed together, at random, into the same

hash key i. That is, x = Ek:h(k)=i (k)xk, for k

0, . - , d - 1 and i = 0, - - . , b - 1. Note that in the case

of ~ 1, x represents the actual frequency counts.

As can be seen, multiple tokens can be mapped, through

h, into the same hash key. According to the birthday para-

dox, if there are at least b features, then collisions are

likely to happen (Shi et al. 2009), and hence, useful infor-

mation necessary for high accuracy classification could be

lost through feature hashing. However, words in a document

collection typically follow a Zipf distribution, i.e., only very

few words occur with high frequency, whereas the majority

of them occur very rarely. Because hash collisions are in-

dependent of word frequencies, most collisions are likely to

happen between infrequent words. Weinberger et al. (2009)

have proven that, for a feature vector x such that x2 = 1,

the length of x is preserved with high probability, for a suf-

ficiently large dimension (or hash size) b and a sufficiently

small magnitude of x, i.e., Ixll (lower and upper bounds

are theoretically derived (Weinberger et al. 2009)).

However, for many practical applications, the value of

b can be smaller than the theoretical lower bound. This

may be problematic as the smaller the size b of the hash

vector xh becomes, the more collisions occur in the data.

Even a single collision of very high frequency words, with

different class distributions, can result in significant loss

of information. In order to avoid such "bad" collisions,

we propose to combine feature hashing (Shi et al. 2009;

Weinberger et al. 2009) with feature abstraction (Silvescu,

Caragea, and Honavar 2009). Specifically, we first analyze

what is the dimension b at which the performance of clas-

sifiers that use hash features starts degrading due to "bad"

hash collisions. We propose to use feature hashing to reduce

very high dimensional input vectors (e.g., 226) into mid-size

dimensional hash vectors (e.g., 216), before the performance

starts degrading, and further reduce the dimensionality of

the hash vectors to smaller dimensions (e.g., to 210) using

feature abstraction that groups together hash features with

"similar" class distributions. Next, we present the combina-

tion of feature hashing with feature abstraction.

Feature Abstraction over Hash Vectors

Feature abstraction effectively reduces a classifier input size

by clustering similar features into an abstraction hierarchy.

Let 7- denote a set of hash features of size b, and rn de-

note the dimension of the desired reduced space for abstrac-

tion (m < b). An abstraction hierarchy (AH) T over 7-1 is

defined as a rooted tree such that the leaf nodes correspond

to the hash features hi in 7-1 and the internal nodes corre-

spond to abstractions or clusters of hash features. An m-cut

7Ym through T is a set of m nodes, which forms a partition

of 7-, that isy,7m {al : t, - - -, am : -m}, where aj de-

## Upcoming Pages

Here’s what’s next.

## Search Inside

This paper can be searched. **Note: **Results may vary based on the legibility of text within the document.

## Tools / Downloads

Get a copy of this page or view the extracted text.

## Citing and Sharing

Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.

### Reference the current page of this Paper.

Caragea, Cornelia; Silvescu, Adrian & Mitra, Prasenjit. Combining Hashing and Abstraction in Sparse High Dimensional Feature Spaces, paper, 2012; [Palo Alto, California]. (digital.library.unt.edu/ark:/67531/metadc181674/m1/3/: accessed September 22, 2018), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT College of Engineering.