Using Encyclopedic Knowledge for Automatic Topic Identification Page: 1
The following text was automatically extracted from the image on this page using optical character recognition software:
Using Encyclopedic Knowledge for Automatic Topic Identification
University of North Texas and Daxtron Laboratories, Inc.
Rada Mihalcea and William Moen
University of North Texas
This paper presents a method for automatic
topic identification using an encyclopedic
graph derived from Wikipedia. The sys-
tem is found to exceed the performance of
previously proposed machine learning algo-
rithms for topic identification, with an annota-
tion consistency comparable to human anno-
With exponentially increasing amounts of text be-
ing generated, it is important to find methods that
can annotate and organize documents in meaning-
ful ways. In addition to the content of the document
itself, other relevant information about a document
such as related topics can often enable a faster and
more effective search or classification. Document
topics have been used for a long time by librarians to
improve the retrieval of a document, and to provide
background or associated information for browsing
by human users. They can also assist search, back-
ground information gathering and contextualization
tasks, and enhanced relevancy measures.
The goal of the work described in this paper is to
automatically find topics that are relevant to an input
document. We refer to this task as "topic identifica-
tion" (Medelyan and Witten, 2008). For instance,
starting with a document on "United States in the
Cold War," we want to identify relevant topics such
as "history," "Global Conflicts," "Soviet Union," and
so forth. We propose an unsupervised method for
topic identification, based on a biased graph cen-
trality algorithm applied to a large knowledge graph
built from Wikipedia.
The task of topic identification goes beyond key-
word extraction (Mihalcea and Csomai, 2007), since
relevant topics may not be necessarily mentioned in
the document, and instead have to be obtained from
some repositories of external knowledge. The task
is also different from text classification (Gabrilovich
and Markovitch, 2006), since the topics are either
not known in advance or are provided in the form of
a controlled vocabulary with thousands of entries,
and thus no classification can be performed. In-
stead, with topic identification, we aim to find topics
(or categories') that are relevant to the document at
hand, which can be used to enrich the content of the
document with relevant external knowledge.
2 Dynamic Ranking of Topic Relevance
Our method is based on the premise that external
encyclopedic knowledge can be used to identify rel-
evant topics for a given document.
The method consists of two main steps. In the first
step, we build a knowledge graph of encyclopedic
concepts based on Wikipedia, where the nodes in the
graph are represented by the entities and categories
that are defined in this encyclopedia. The edges be-
tween the nodes are represented by their relation of
proximity inside the Wikipedia articles. The graph
is built once and then it is stored offline, so that it
can be efficiently use for the identification of topics
in new documents.
In the second step, for each input document, we
first identify the important encyclopedic concepts in
the text, and thus create links between the content of
the document and the external encyclopedic graph.
Next, we run a biased graph centrality algorithm on
the entire graph, so that all the nodes in the exter-
nal knowledge repository are ranked based on their
relevance to the input document. We use a variation
1Throughout the paper, we use the terms "topic" and "cate-
Here’s what’s next.
This paper can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Paper.
Coursey, Kino High; Mihalcea, Rada, 1974- & Moen, William E. Using Encyclopedic Knowledge for Automatic Topic Identification, paper, May 2009; [Stroudsburg, Pennsylvania]. (digital.library.unt.edu/ark:/67531/metadc31022/m1/1/: accessed October 20, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT College of Engineering.