Automatic generation of a coarse grained WordNet Page: 4
6 p.View a full description of this paper.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
Part of Probability of having sense number
speech 1 2 3 4 5 6 7 8 >8
Noun 78.52% 12.73% 4.40% 2.07% 0.98% 0.52% 0.34% 0.17% <0.1%
Verb 61.01% 19.22% 7.89% 4.12% 2.64% 1.47% 0.98% 0.65% <0.5%
Adj 80.98% 12.35% 3.96% 1.41% 0.51% 0.25% 0.16% 0.16% <0.05%
Adv 83.84% 11.24% 3.67% 0.61% 0.42% 0.15% 0.03% 0.009% <0.009%
Table 3: Statistics on SemCor: distribution of senses for nouns, verbs, adjectives and adverbswords, the noun "race" is determined to have differ-
ent senses in these two examples.
Reformulated, this principle becomes: if a word,
in two different contexts, is paronymically related to
the same words, then the word has similar meanings
in the given contexts.
The translation into WordNet terminology can-
not be done for all parts of speech, as there are no
WordNet relations between verbs and nouns, there-
fore paronymic relations of the form act - actor can-
not be extracted. This principle can be used for
adjectives and adverbs, where a pertainymy relation
is defined. The following rule is derived:
Rule SP3 If S1 and S2 are two synsets represent-
ing two senses of a given word, and if S1 and S2
have the same pertainym, then S1 and S2 can be
collapsed together into one single synset S12.
Example: Senses #1 and #5 for the adverb lightly
are:
S1 = {lightly} (without good reason)
pertainym {light#5} (psychologically light)
S5 = {lightly} (with indifference or without dejec-
tion)
pertainym {light#5} (psychologically light)
4 Probabilistic principles
Besides the principles presented in the previous sec-
tion, the polysemy of WordNet can be reduced based
on the frequency of senses and the probability of hav-
ing particular synsets used in a text. By dropping
synsets with very low probability of occurrence, we
can reduce the number of senses a word might have.
We need (1) a distribution of sense frequencies
for the different parts of speech and (2) a method
of deriving the probability of a synset occurring in a
text, starting with the probabilities of its component
words.
To determine the distribution of word sense fre-
quencies, we used again SemCor, as the only corpus
available in which all words are sense tagged using
WordNet. Table 3 show the sense distributions for
nouns, verbs, adjectives and adverbs.
Let us denote a synset with S = { Wi,, Wi1 ... ,
Win }, meaning that the synset is composed by words
having senses i1, i2 ... , in. If we denote with Pik the
probability of occurrence of a word having sense ik,
then the probability of occurrence Ps for the synsetS is equal with the summation of the probabilities
of occurrence for the component words, i.e. Ps -
k-1 Pik,
In order to reduce the granularity of WordNet
without introducing too much ambiguity, we use
this formula together with probabilities derived from
SemCor, and drop those synsets with a probability of
occurrence Ps smaller than a given threshold. The
following rule is derived:
Rule PP1 If S is a synset S = { Wi1, Wi1 ...
Win } with the probability of occurrence Ps -
k k Pik < Maxp then S can be considered as a
very rarely occurring synset and it can be dropped.
Example: The noun synset S = { draft#11,
draught#5, drawing#6} (the act of moving a load
by drawing or pulling) has the probability of oc-
currence Ps = P11 + P5 + P6 - 0.1% + 0.98% +
0.52% = 1.6%. For Maxp set to 2.0, this synset
can be dropped.
Note that this way of computing the probability of
a synset does not make reference to the component
words themselves, but to their senses, and thus we do
not have to deal with the problem of data sparseness
that would result from the limited size of the corpus.
5 Applying the principles on
WordNet
We applied these semantic and probabilistic princi-
ples on WordNet and generated two new versions,
called EZ. WordNet. 1 and EZ. WordNet. 2.
The semantic principles resulted in collapsed
synsets, while the probabilistic principles deter-
mined which synsets can be dropped. By applying
these rules, we obtain a reduction in the number of
synsets, and implicitly a reduction in the number of
word senses.
There are two variables used by the reduction
rules: the K minimum number of common synonyms
among two synsets, as required by Rule SP1.3, and
Maxp, which is the maximum probability threshold
for Rule PP1. Depending on the values selected for
these parameters, one can obtain sense inventories
closer to the original WordNet, but with a smaller
reduction in polysemy, or versions of WordNet with a
higher reduction in polysemy but with more synsets
modified respect to WordNet.
Upcoming Pages
Here’s what’s next.
Search Inside
This paper can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Paper.
Mihalcea, Rada, 1974- & Moldovan, Dan I. Automatic generation of a coarse grained WordNet, paper, June 2001; (https://digital.library.unt.edu/ark:/67531/metadc83310/m1/4/: accessed April 24, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT College of Engineering.