Corpus-based and Knowledge-based Measures of Text Semantic Similarity

Corpus-based and Knowledge-based Measures of Text Semantic Similarity

Date: July 2006
Creator: Mihalcea, Rada, 1974-; Corley, Courtney & Strapparava, Carlo, 1962-
Description: Abstract: This paper presents a method for measuring the semantic similarity of texts, using corpus-based and knowledge-based measures of similarity. Previous work on this problem has focused mainly on either large documents (e.g. text classification, information retrieval) or individual words (e.g. synonymy tests). Given that a large fraction of the information available today, on the Web and elsewhere, consists of short text snippets (e.g. abstracts of scientific documents, image captions, product descriptions), in this paper the authors focus on measuring the semantic similarity of short texts. Through experiments performed on a paraphrase data set, the authors show that the semantic similarity method out-performs methods based on simple lexical matching, resulting in up to 13% error rate reduction with respect to the traditional vector-based similarity metric.
Contributing Partner: UNT College of Engineering
Building Multilingual Semantic Networks with Non-Expert Contributions over the Web

Building Multilingual Semantic Networks with Non-Expert Contributions over the Web

Date: November 2003
Creator: Ayewah, Nathanial; Mihalcea, Rada, 1974- & Nastase, Vivi
Description: This paper discusses building multilingual semantic networks. Abstract: We present a system that allows non-expert Web users to contribute towards building a multilingual lexical resource. Our study focuses on the Romanian-English language pair, and the target resource is a Romanian WordNet strongly connected to the English WordNet. We use a bilingual dictionary, a monolingual definition dictionary and documents on the Web to build synsets, attach them a gloss, and provide some examples. The results of the semi-automatic acquisition system are judged by two human judges, and they are compared to automatic approaches to building a Romanian WordNet.
Contributing Partner: UNT College of Engineering
Measuring the Semantic Similarity of Texts

Measuring the Semantic Similarity of Texts

Date: June 2005
Creator: Corley, Courtney & Mihalcea, Rada
Description: This paper presents a knowledge-based method for measuring the semantic-similarity of texts. While there is a large body of previous work focused on finding the semantic similarity of concepts and words, the application of these word-oriented methods to text similarity has not been yet explored. In this paper, the authors introduce a method that combines word-to-word similarity metrics into a text-to-text metric, and the authors show that this method outperforms the traditional text similarity metrics based on lexical matching.
Contributing Partner: UNT College of Engineering
Putting Pieces Together: Combining FrameNet, VerbNet and WordNet for Robust Semantic Parsing

Putting Pieces Together: Combining FrameNet, VerbNet and WordNet for Robust Semantic Parsing

Date: 2005
Creator: Shi, Lei & Mihalcea, Rada
Description: This paper describes the authors' work in integrating three different lexical resources: FrameNet, VerbNet, and WordNet, into a unified, richer knowledge-base, to the end of enabling more robust semantic parsing. The construction of each of these lexical resources has required many years of laborious human effort, and they all have their strengths and shortcomings. By linking them together, the authors build an improved resource in which (1) the coverage of FrameNet is extended, (2) the VerbNet lexicon is augmented with frame semantics, and (3) selectional restrictions are implemented using WordNet semantic classes. The synergistic exploitation of various lexical resources is crucial for many complex language processing applications, and the authors prove it once again effective in building a robust semantic parser.
Contributing Partner: UNT College of Engineering
SemEval-2007 Task 14: Affective Text

SemEval-2007 Task 14: Affective Text

Date: June 2007
Creator: Strapparava, Carlo & Mihalcea, Rada
Description: This paper discusses affective text. The "Affective Text" task focuses on the classification of emotions and valence (positive/negative polarity) in news headlines, and is meant as an exploration of the connection between emotions and lexical semantics. In this paper, the authors describe the data set used in the evaluation and the results obtained by the participating systems.
Contributing Partner: UNT College of Engineering
Linguistic Ethnography: Identifying Dominant Word Classes in Text

Linguistic Ethnography: Identifying Dominant Word Classes in Text

Date: March 2009
Creator: Pulman, Stephen & Mihalcea, Rada, 1974-
Description: This paper discusses linguistic ethnography. Abstract: In this paper, we propose a method for "linguistic ethnography" - a general mechanism for characterizing texts with respect to the dominance of certain classes of words. Using humor as a case study, the authors explore the automatic learning of salient word classes, including semantic classes (e.g., person, animal), psycholinguistic classes (e.g., tentative, cause), and affective load (e.g., anger, happiness). We measure the reliability of the derived word classes and their associated dominance scores by showing significant correlation across different corpora.
Contributing Partner: UNT College of Engineering