21 Matching Results

Search Results

Advanced search parameters have been applied.

PNNL: A Supervised Maximum Entropy Approach to Word Sense Disambiguation

Description: In this paper, we described the PNNL Word Sense Disambiguation system as applied to the English All-Word task in Se-mEval 2007. We use a supervised learning approach, employing a large number of features and using Information Gain for dimension reduction. Our Maximum Entropy approach combined with a rich set of features produced results that are significantly better than baseline and are the highest F-score for the fined-grained English All-Words subtask.
Date: June 23, 2007
Creator: Tratz, Stephen C.; Sanfilippo, Antonio P.; Gregory, Michelle L.; Chappell, Alan R.; Posse, Christian & Whitney, Paul D.
Partner: UNT Libraries Government Documents Department

[Alexis Palmer giving a presentation at the 2017 Symposium on Developing Infrastructure for Computational Resources on South Asian Languages]

Description: Photograph of Alexis Palmer giving a presentation. This is during the presentation "A View from CL/NLP" at the 2017 Symposium on Developing Infrastructure for Computational Resources on South Asian Languages.
Date: November 17, 2017
Creator: University of North Texas. College of Information.
Partner: UNT College of Information

Exploration of Visual, Acoustic, and Physiological Modalities to Complement Linguistic Representations for Sentiment Analysis

Description: This research is concerned with the identification of sentiment in multimodal content. This is of particular interest given the increasing presence of subjective multimodal content on the web and other sources, which contains a rich and vast source of people's opinions, feelings, and experiences. Despite the need for tools that can identify opinions in the presence of diverse modalities, most of current methods for sentiment analysis are designed for textual data only, and few attempts have been made to address this problem. The dissertation investigates techniques for augmenting linguistic representations with acoustic, visual, and physiological features. The potential benefits of using these modalities include linguistic disambiguation, visual grounding, and the integration of information about people's internal states. The main goal of this work is to build computational resources and tools that allow sentiment analysis to be applied to multimodal data. This thesis makes three important contributions. First, it shows that modalities such as audio, video, and physiological data can be successfully used to improve existing linguistic representations for sentiment analysis. We present a method that integrates linguistic features with features extracted from these modalities. Features are derived from verbal statements, audiovisual recordings, thermal recordings, and physiological sensors signals. The resulting multimodal sentiment analysis system is shown to significantly outperform the use of language alone. Using this system, we were able to predict the sentiment expressed in video reviews and also the sentiment experienced by viewers while exposed to emotionally loaded content. Second, the thesis provides evidence of the portability of the developed strategies to other affect recognition problems. We provided support for this by studying the deception detection problem. Third, this thesis contributes several multimodal datasets that will enable further research in sentiment and deception detection.
Access: This item is restricted to UNT Community Members. Login required if off-campus.
Date: December 2014
Creator: Pérez-Rosas, Verónica
Partner: UNT Libraries

A View from CL/NLP

Description: Presentation for the 2017 Symposium on Developing Infrastructure for Computational Resources on South Asian Languages. This presentation provides an overview of the data structures needed for computational linguistics and natural language processing.
Date: November 17, 2017
Creator: Palmer, Alexis
Partner: UNT College of Information

[Alexis Palmer speaking to attendees at the 2017 Symposium on Developing Infrastructure for Computational Resources on South Asian Languages]

Description: Photograph of Alexis Palmer speaking to attendees. This is during the presentation "A View from CL/NLP" at the 2017 Symposium on Developing Infrastructure for Computational Resources on South Asian Languages.
Date: November 17, 2017
Creator: University of North Texas. College of Information.
Location Info:
Partner: UNT College of Information

A View from CL/NLP

Description: Video recording of a presentation session at the 2017 Symposium on Developing Infrastructure for Computational Resources on South Asian Languages. In this session, the presenter provides an overview of the data structures needed for computational linguistics and natural language processing.
Date: November 17, 2017
Creator: Palmer, Alexis
Partner: UNT College of Information

LinguisticBelief: a java application for linguistic evaluation using belief, fuzzy sets, and approximate reasoning.

Description: LinguisticBelief is a Java computer code that evaluates combinations of linguistic variables using an approximate reasoning rule base. Each variable is comprised of fuzzy sets, and a rule base describes the reasoning on combinations of variables fuzzy sets. Uncertainty is considered and propagated through the rule base using the belief/plausibility measure. The mathematics of fuzzy sets, approximate reasoning, and belief/ plausibility are complex. Without an automated tool, this complexity precludes their application to all but the simplest of problems. LinguisticBelief automates the use of these techniques, allowing complex problems to be evaluated easily. LinguisticBelief can be used free of charge on any Windows XP machine. This report documents the use and structure of the LinguisticBelief code, and the deployment package for installation client machines.
Date: March 1, 2007
Creator: Darby, John L.
Partner: UNT Libraries Government Documents Department

Cross Language Information Retrieval for Languages with Scarce Resources

Description: Our generation has experienced one of the most dramatic changes in how society communicates. Today, we have online information on almost any imaginable topic. However, most of this information is available in only a few dozen languages. In this thesis, I explore the use of parallel texts to enable cross-language information retrieval (CLIR) for languages with scarce resources. To build the parallel text I use the Bible. I evaluate different variables and their impact on the resulting CLIR system, specifically: (1) the CLIR results when using different amounts of parallel text; (2) the role of paraphrasing on the quality of the CLIR output; (3) the impact on accuracy when translating the query versus translating the collection of documents; and finally (4) how the results are affected by the use of different dialects. The results show that all these variables have a direct impact on the quality of the CLIR system.
Date: May 2009
Creator: Loza, Christian
Partner: UNT Libraries

Improving Topic Tracking with Domain Chaining

Description: Topic Detection and Tracking (TDT) research has produced some successful statistical tracking systems. While lexical chaining, a non-statistical approach, has also been applied to the task of tracking by Carthy and Stokes for the 2001 TDT evaluation, an efficient tracking system based on this technology has yet to be developed. In thesis we investigate two new techniques which can improve Carthy's original design. First, at the core of our system is a semantic domain chainer. This chainer relies not only on the WordNet database for semantic relationships but also on Magnini's semantic domain database, which is an extension of WordNet. The domain-chaining algorithm is a linear algorithm. Second, to handle proper nouns, we gather all of the ones that occur in a news story together in a chain reserved for proper nouns. In this thesis we also discuss the linguistic limitations of lexical chainers to represent textual meaning.
Date: August 2003
Creator: Yang, Li
Partner: UNT Libraries

Graph-based Centrality Algorithms for Unsupervised Word Sense Disambiguation

Description: This thesis introduces an innovative methodology of combining some traditional dictionary based approaches to word sense disambiguation (semantic similarity measures and overlap of word glosses, both based on WordNet) with some graph-based centrality methods, namely the degree of the vertices, Pagerank, closeness, and betweenness. The approach is completely unsupervised, and is based on creating graphs for the words to be disambiguated. We experiment with several possible combinations of the semantic similarity measures as the first stage in our experiments. The next stage attempts to score individual vertices in the graphs previously created based on several graph connectivity measures. During the final stage, several voting schemes are applied on the results obtained from the different centrality algorithms. The most important contributions of this work are not only that it is a novel approach and it works well, but also that it has great potential in overcoming the new-knowledge-acquisition bottleneck which has apparently brought research in supervised WSD as an explicit application to a plateau. The type of research reported in this thesis, which does not require manually annotated data, holds promise of a lot of new and interesting things, and our work is one of the first steps, despite being a small one, in this direction. The complete system is built and tested on standard benchmarks, and is comparable with work done on graph-based word sense disambiguation as well as lexical chains. The evaluation indicates that the right combination of the above mentioned metrics can be used to develop an unsupervised disambiguation engine as powerful as the state-of-the-art in WSD.
Date: December 2008
Creator: Sinha, Ravi Som
Partner: UNT Libraries

Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various Approaches

Description: Automatic language identification has been applied to short texts such as queries in information retrieval, but it has not yet been applied to metadata records. Applying this technology to metadata records, particularly their title elements, would enable creators of metadata records to obtain a value for the language element, which is often left blank due to a lack of linguistic expertise. It would also enable the addition of the language value to existing metadata records that currently lack a language value. Titles lend themselves to the problem of language identification mainly due to their shortness, a factor which increases the difficulty of accurately identifying a language. This study implemented four proven approaches to language identification as well as one open-source approach on a collection of multilingual titles of books and movies. Of the five approaches considered, a reduced N-gram frequency profile and distance measure approach outperformed all others, accurately identifying over 83% of all titles in the collection. Future plans are to offer this technology to curators of digital collections for use.
Date: May 2015
Creator: Knudson, Ryan Charles
Partner: UNT Libraries

A High Accuracy Method for Semi-supervised Information Extraction

Description: Customization to specific domains of dis-course and/or user requirements is one of the greatest challenges for today’s Information Extraction (IE) systems. While demonstrably effective, both rule-based and supervised machine learning approaches to IE customization pose too high a burden on the user. Semi-supervised learning approaches may in principle offer a more resource effective solution but are still insufficiently accurate to grant realistic application. We demonstrate that this limitation can be overcome by integrating fully-supervised learning techniques within a semi-supervised IE approach, without increasing resource requirements.
Date: April 22, 2007
Creator: Tratz, Stephen C. & Sanfilippo, Antonio P.
Partner: UNT Libraries Government Documents Department

The Value of Everything: Ranking and Association with Encyclopedic Knowledge

Description: This dissertation describes WikiRank, an unsupervised method of assigning relative values to elements of a broad coverage encyclopedic information source in order to identify those entries that may be relevant to a given piece of text. The valuation given to an entry is based not on textual similarity but instead on the links that associate entries, and an estimation of the expected frequency of visitation that would be given to each entry based on those associations in context. This estimation of relative frequency of visitation is embodied in modifications to the random walk interpretation of the PageRank algorithm. WikiRank is an effective algorithm to support natural language processing applications. It is shown to exceed the performance of previous machine learning algorithms for the task of automatic topic identification, providing results comparable to that of human annotators. Second, WikiRank is found useful for the task of recognizing text-based paraphrases on a semantic level, by comparing the distribution of attention generated by two pieces of text using the encyclopedic resource as a common reference. Finally, WikiRank is shown to have the ability to use its base of encyclopedic knowledge to recognize terms from different ontologies as describing the same thing, and thus allowing for the automatic generation of mapping links between ontologies. The conclusion of this thesis is that the "knowledge access heuristic" is valuable and that a ranking process based on a large encyclopedic resource can form the basis for an extendable general purpose mechanism capable of identifying relevant concepts by association, which in turn can be effectively utilized for enumeration and comparison at a semantic level.
Date: December 2009
Creator: Coursey, Kino High
Partner: UNT Libraries

A Minimally Supervised Word Sense Disambiguation Algorithm Using Syntactic Dependencies and Semantic Generalizations

Description: Natural language is inherently ambiguous. For example, the word "bank" can mean a financial institution or a river shore. Finding the correct meaning of a word in a particular context is a task known as word sense disambiguation (WSD), which is essential for many natural language processing applications such as machine translation, information retrieval, and others. While most current WSD methods try to disambiguate a small number of words for which enough annotated examples are available, the method proposed in this thesis attempts to address all words in unrestricted text. The method is based on constraints imposed by syntactic dependencies and concept generalizations drawn from an external dictionary. The method was tested on standard benchmarks as used during the SENSEVAL-2 and SENSEVAL-3 WSD international evaluation exercises, and was found to be competitive.
Date: December 2005
Creator: Faruque, Md. Ehsanul
Partner: UNT Libraries

Design and Implementation of a TRAC Processor for Fairchild F24 Computer

Description: TRAC is a text-processing language for use with a reactive typewriter. The thesis describes the design and implementation of a TRAC processor for the Fairchild F24 computer. Chapter I introduces some text processing concepts, the TRAC operations, and the implementation procedures. Chapter II examines the history and -characteristics of the TRAC language. The next chapter specifies the TRAC syntax and primitive functions. Chapter IV covers the algorithms used by the processor. The last chapter discusses the design experience from programming the processor, examines the reactive action caused by the processor, and suggests adding external storage primitive functions for a future version of the processor.
Date: August 1974
Creator: Chi, Ping Ray
Partner: UNT Libraries

Text Processing for Technical Reports (Direct Computer-Assisted Origination, Editing, and Output of Text)

Description: Report documenting the creation of a computer program (written in FORTRAN and MACRO) to assist researchers in writing technical documents that include formulas and graphics. It includes operating instructions for using the program and example documents.
Date: October 1980
Creator: DeVolpi, Alexander; Fenrick, M. R.; Stanford, G. S.; Fink, C. L. & Rhodes, E. A.
Partner: UNT Libraries Government Documents Department

Linguistic evaluation of terrorist scenarios: example application.

Description: In 2005, a group of international decision makers developed a manual process for evaluating terrorist scenarios. That process has been implemented in the approximate reasoning Java software tool, LinguisticBelief, released in FY2007. One purpose of this report is to show the flexibility of the LinguisticBelief tool to automate a custom model developed by others. LinguisticBelief evaluates combinations of linguistic variables using an approximate reasoning rule base. Each variable is comprised of fuzzy sets, and a rule base describes the reasoning on combinations of variables fuzzy sets. Uncertainty is considered and propagated through the rule base using the belief/plausibility measure. This report documents the evaluation and rank-ordering of several example terrorist scenarios for the existing process implemented in our software. LinguisticBelief captures and propagates uncertainty and allows easy development of an expanded, more detailed evaluation, neither of which is feasible using a manual evaluation process. In conclusion, the Linguistic-Belief tool is able to (1) automate an expert-generated reasoning process for the evaluation of the risk of terrorist scenarios, including uncertainty, and (2) quickly evaluate and rank-order scenarios of concern using that process.
Date: March 1, 2007
Creator: Darby, John L.
Partner: UNT Libraries Government Documents Department

Identification of threats using linguistics-based knowledge extraction.

Description: One of the challenges increasingly facing intelligence analysts, along with professionals in many other fields, is the vast amount of data which needs to be reviewed and converted into meaningful information, and ultimately into rational, wise decisions by policy makers. The advent of the world wide web (WWW) has magnified this challenge. A key hypothesis which has guided us is that threats come from ideas (or ideology), and ideas are almost always put into writing before the threats materialize. While in the past the 'writing' might have taken the form of pamphlets or books, today's medium of choice is the WWW, precisely because it is a decentralized, flexible, and low-cost method of reaching a wide audience. However, a factor which complicates matters for the analyst is that material published on the WWW may be in any of a large number of languages. In 'Identification of Threats Using Linguistics-Based Knowledge Extraction', we have sought to use Latent Semantic Analysis (LSA) and other similar text analysis techniques to map documents from the WWW, in whatever language they were originally written, to a common language-independent vector-based representation. This then opens up a number of possibilities. First, similar documents can be found across language boundaries. Secondly, a set of documents in multiple languages can be visualized in a graphical representation. These alone offer potentially useful tools and capabilities to the intelligence analyst whose knowledge of foreign languages may be limited. Finally, we can test the over-arching hypothesis--that ideology, and more specifically ideology which represents a threat, can be detected solely from the words which express the ideology--by using the vector-based representation of documents to predict additional features (such as the ideology) within a framework based on supervised learning. In this report, we present the results of a three-year project of the same name. We ...
Date: September 1, 2008
Creator: Chew, Peter A.
Partner: UNT Libraries Government Documents Department