Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge Page: 1
The following text was automatically extracted from the image on this page using optical character recognition software:
Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge
Samer Hassan and Rada Mihalcea
Department of Computer Science
University of North Texas
In this paper, we address the task of cross-
lingual semantic relatedness. We intro-
duce a method that relies on the informa-
tion extracted from Wikipedia, by exploit-
ing the interlanguage links available be-
tween Wikipedia versions in multiple lan-
guages. Through experiments performed
on several language pairs, we show that
the method performs well, with a perfor-
mance comparable to monolingual mea-
sures of relatedness.
Given the accelerated growth of the number of
multilingual documents on the Web and else-
where, the need for effective multilingual and
cross-lingual text processing techniques is becom-
ing increasingly important. In this paper, we
address the task of cross-lingual semantic relat-
edness, and introduce a method that relies on
Wikipedia in order to calculate the relatedness of
words across languages. For instance, given the
word factory in English and the word lavoratore
in Italian (En. worker), the method can measure
the relatedness of these two words despite the fact
that they belong to two different languages.
Measures of cross-language relatedness are use-
ful for a large number of applications, including
cross-language information retrieval (Nie et al.,
1999; Monz and Dorr, 2005), cross-language text
classification (Gliozzo and Strapparava, 2006),
lexical choice in machine translation (Och and
Ney, 2000; Bangalore et al., 2007), induction
of translation lexicons (Schafer and Yarowsky,
2002), cross-language annotation and resource
projections to a second language (Riloff et al.,
2002; Hwa et al., 2002; Mohammad et al., 2007).
The method we propose is based on a measure
of closeness between concept vectors automati-
cally built from Wikipedia, which are mapped via
rada@cs unt. edu
the Wikipedia interlanguage links. Unlike previ-
ous methods for cross-language mapping, which
are typically limited by the availability of bilingual
dictionaries or parallel texts, the method proposed
in this paper can be used to measure the related-
ness of word pairs in any of the 250 languages for
which a Wikipedia version exists.
The paper is organized as follows. We first pro-
vide a brief overview of Wikipedia, followed by
a description of the method to build concept vec-
tors based on this encyclopedic resource. We then
show how these concept vectors can be mapped
across languages for a cross-lingual measure of
word relatedness. Through evaluations run on six
language pairs, connecting English, Spanish, Ara-
bic and Romanian, we show that the method is ef-
fective at capturing the cross-lingual relatedness of
words, with results comparable to the monolingual
measures of relatedness.
Wikipedia is a free online encyclopedia, represent-
ing the outcome of a continuous collaborative ef-
fort of a large number of volunteer contributors.
Virtually any Internet user can create or edit a
Wikipedia webpage, and this "freedom of contri-
bution" has a positive impact on both the quantity
(fast-growing number of articles) and the quality
(potential errors are quickly corrected within the
collaborative environment) of this online resource.
The basic entry in Wikipedia is an article (or
page), which defines and describes an entity or
an event, and consists of a hypertext document
with hyperlinks to other pages within or outside
Wikipedia. The role of the hyperlinks is to guide
the reader to pages that provide additional infor-
mation about the entities or events mentioned in
an article. Articles are organized into categories,
which in turn are organized into hierarchies. For
instance, the article automobile is included in the
category vehicle, which in turn has a parent cate-
Here’s what’s next.
This paper can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Paper.
Hassan, Samer & Mihalcea, Rada, 1974-. Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge, paper, August 2009; [Stroudsburg, Pennsylvania]. (digital.library.unt.edu/ark:/67531/metadc31012/m1/1/: accessed September 21, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT College of Engineering.