PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents Page: 1,105
This article is part of the collection entitled: UNT Scholarly Works and was provided to UNT Digital Library by the UNT College of Engineering.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
PositionRank: An Unsupervised Approach to Keyphrase Extraction
from Scholarly Documents
Corina Florescu and Cornelia Caragea
Computer Science and Engineering
University of North Texas, USA
CorinaFlorescu@my.unt.edu, ccaragea@unt.eduAbstract
The large and growing amounts of online
scholarly data present both challenges and
opportunities to enhance knowledge dis-
covery. One such challenge is to auto-
matically extract a small set of keyphrases
from a document that can accurately de-
scribe the document's content and can fa-
cilitate fast information processing. In
this paper, we propose PositionRank, an
unsupervised model for keyphrase extrac-
tion from scholarly documents that incor-
porates information from all positions of a
word's occurrences into a biased PageR-
ank. Our model obtains remarkable im-
provements in performance over PageR-
ank models that do not take into account
word positions as well as over strong base-
lines for this task. Specifically, on several
datasets of research papers, PositionRank
achieves improvements as high as 29.09%.
1 Introduction
The current Scholarly Web contains many millions
of scientific documents. For example, Google
Scholar is estimated to have more than 100 million
documents. On one hand, these rapidly-growing
scholarly document collections offer benefits for
knowledge discovery, and on the other hand, find-
ing useful information has become very challeng-
ing. Keyphrases associated with a document typi-
cally provide a high-level topic description of the
document and can allow for efficient information
processing. In addition, keyphrases are shown
to be rich sources of information in many natu-
ral language processing and information retrieval
tasks such as scientific paper summarization, clas-
sification, recommendation, clustering, and search
(Abu-Jbara and Radev, 2011; Qazvinian et al.,2010; Jones and Staveley, 1999; Zha, 2002; Zhang
et al., 2004; Hammouda et al., 2005). Due to their
importance, many approaches to keyphrase extrac-
tion have been proposed in the literature along two
lines of research: supervised and unsupervised
(Hasan and Ng, 2014, 2010).
In the supervised line of research, keyphrase
extraction is formulated as a binary classification
problem, where candidate phrases are classified as
either positive (i.e., keyphrases) or negative (i.e.,
non-keyphrases) (Frank et al., 1999; Hulth, 2003).
Various feature sets and classification algorithms
yield different extraction systems. For example,
Frank et al. (1999) developed a system that ex-
tracts two features for each candidate phrase, i.e.,
the tf-idf of the phrase and its distance from the be-
ginning of the target document, and uses them as
input to Naive Bayes classifiers. Although super-
vised approaches typically perform better than un-
supervised approaches (Kim et al., 2013), the re-
quirement for large human-annotated corpora for
each field of study has led to significant attention
towards the design of unsupervised approaches.
In the unsupervised line of research, keyphrase
extraction is formulated as a ranking problem with
graph-based ranking techniques being considered
state-of-the-art (Hasan and Ng, 2014). These
graph-based techniques construct a word graph
from each target document, such that nodes cor-
respond to words and edges correspond to word
association patterns. Nodes are then ranked us-
ing graph centrality measures such as PageRank
(Mihalcea and Tarau, 2004; Liu et al., 2010) or
HITS (Litvak and Last, 2008), and the top ranked
phrases are returned as keyphrases. Since their
introduction, many graph-based extensions have
been proposed, which aim at modeling various
types of information. For example, Wan and Xiao
(2008) proposed a model that incorporates a local
105Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1105-1115
Vancouver, Canada, July 30 - August 4, 2017. 2017 Association for Computational Linguistics
https://doi.org/10.18653/vl/P17-1102
Upcoming Pages
Here’s what’s next.
Search Inside
This article can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Article.
Florescu, Corina & Caragea, Cornelia. PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents, article, August 2017; Stroudsburg, Pennsylvania. (https://digital.library.unt.edu/ark:/67531/metadc990953/m1/1/: accessed March 28, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT College of Engineering.