Classifier Stacking and Voting for Text Filtering Page: 1
This paper is part of the collection entitled: UNT Scholarly Works and was provided to UNT Digital Library by the UNT College of Engineering.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
Classifier Stacking and Voting for Text Filtering
Rada MIHALCEA
University of North Texas
Denton, Texas, 76203-1366
rada@cs.unt.edu
Abstract
This paper summarizes the approach and the results of the TextCat system participating in the Filtering track in the Text
Retrieval Conference 2002. The system relies primarily on statistical methods, and was designed with the main purpose of
having a backbone system in which we can further integrate semantic components, and evaluate their relative performance
as compared to traditional statistical approaches. The system is therefore simple, and is based on techniques for keywords
extraction, and various classifier combinations including stacking and voting. TextCat participated in the Batch and
Routing tasks. In the Batch task, it achieved a score of 39.02% normalized utility, and 26.37% F-measure respectively,
averaged over all topics. The averaged uninterpolated precision for our best routing submission was 14.16%.1. Introduction
The Filtering track has quite a long history in the
Text Retrieval Conference (TREC) series. The goal of
the track is to measure the ability of systems to clas-
sify new documents as relevant or irrelevant with re-
spect to a given topic. While there are three different
tasks organized within the Filtering track - adaptive,
batch, and routing - our Text Categorization (TextCat)
system participated only in the last two tasks. Few
changes in the system would have probably allowed
us to run TextCat on the adaptive filtering data as
well; however, we decided to focus on the classifi-
cation capabilities of the system, rather than on its
adaptability to new incoming data. This is mainly be-
cause the purpose for building TextCat was to have
a backbone text classification system, in which we
can further integrate semantic modules and evaluate
their relative performance as compared to the sim-
ple statistical approach. This follows up on our pre-
vious work in semantic-based Information Retrieval
(Mihalcea, 2002), where various degrees of seman-
tic knowledge where integrated into an existing Infor-
mation Retrieval system (SMART (Salton and Lesk,
1971)). To extend this work to the text classifica-
tion problem, we needed in the first place a basic text
categorization system, which could then be expanded
with more sophisticated modules. Since we were not
able to find such a tool (reliable, free for download,
with complete source code), we started building ourown text categorization system, which ultimately re-
sulted in the UNT TextCat system.
2. UNT TextCat
As stated in the title, TextCat relies on combina-
tions of simple text classifiers, which includes stack-
ing and voting. Starting with a basic ngram-based
classifier and a rule-based classifier, we generate a
range of new classifiers by making simple changes in
the value of their input parameters. First, stacking is
done by applying the rule-based classifier on the out-
put produced by the ngram-based classifier. Second,
classifier voting is performed using various degrees
of inter-classifier agreement. In turn, different voting
schemes generate new classifiers.
For the Batch task, we had a total of thirteen
stacked classifiers, which were then fed to the voting
scheme, such that ten more combined classifiers were
generated. Out of this total number of 23 classifiers,
one was chosen according to its performance during
cross validation runs performed on the training data.
This tuning on training data was done separately for
the normalized utility measure and for the F-measure,
resulting in two different submissions, UNTextCatSU
(run optimized for the T11SU measure), and UN-
TextCatF (run optimized for the T11F measure).
For the Routing task, we used a single combi-
nation of the thirteen stacked classifiers (run UN-
TextCatR), and a combination of the thirteen ngram-
Upcoming Pages
Here’s what’s next.
Search Inside
This paper can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Paper.
Mihalcea, Rada, 1974-. Classifier Stacking and Voting for Text Filtering, paper, November 2002; [Gaithersburg, Maryland]. (https://digital.library.unt.edu/ark:/67531/metadc30942/m1/1/: accessed April 23, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT College of Engineering.