Keywords in the mist: Automated keyword extraction for very large documents and back of the book indexing.

PDF Version Also Available for Download.

Description

This research addresses the problem of automatic keyphrase extraction from large documents and back of the book indexing. The potential benefits of automating this process are far reaching, from improving information retrieval in digital libraries, to saving countless man-hours by helping professional indexers creating back of the book indexes. The dissertation introduces a new methodology to evaluate automated systems, which allows for a detailed, comparative analysis of several techniques for keyphrase extraction. We introduce and evaluate both supervised and unsupervised techniques, designed to balance the resource requirements of an automated system and the best achievable performance. Additionally, a number of ... continued below

Creation Information

Csomai, Andras May 2008.

Context

This dissertation is part of the collection entitled: UNT Theses and Dissertations and was provided by UNT Libraries to Digital Library, a digital repository hosted by the UNT Libraries. It has been viewed 679 times , with 6 in the last month . More information about this dissertation can be viewed below.

Who

People and organizations associated with either the creation of this dissertation or its content.

Chair

Committee Members

Publisher

Rights Holder

For guidance see Citations, Rights, Re-Use.

  • Csomai, Andras

Provided By

UNT Libraries

With locations on the Denton campus of the University of North Texas and one in Dallas, UNT Libraries serves the school and the community by providing access to physical and online collections; The Portal to Texas History and UNT Digital Libraries; academic research, and much, much more.

Contact Us

What

Descriptive information to help identify this dissertation. Follow the links below to find similar items on the Digital Library.

Description

This research addresses the problem of automatic keyphrase extraction from large documents and back of the book indexing. The potential benefits of automating this process are far reaching, from improving information retrieval in digital libraries, to saving countless man-hours by helping professional indexers creating back of the book indexes. The dissertation introduces a new methodology to evaluate automated systems, which allows for a detailed, comparative analysis of several techniques for keyphrase extraction. We introduce and evaluate both supervised and unsupervised techniques, designed to balance the resource requirements of an automated system and the best achievable performance. Additionally, a number of novel features are proposed, including a statistical informativeness measure based on chi statistics; an encyclopedic feature that taps into the vast knowledge base of Wikipedia to establish the likelihood of a phrase referring to an informative concept; and a linguistic feature based on sophisticated semantic analysis of the text using current theories of discourse comprehension. The resulting keyphrase extraction system is shown to outperform the current state of the art in supervised keyphrase extraction by a large margin. Moreover, a fully automated back of the book indexing system based on the keyphrase extraction system was shown to lead to back of the book indexes closely resembling those created by human experts.

Subjects

Language

Identifier

Unique identifying numbers for this dissertation in the Digital Library or other systems.

Collections

This dissertation is part of the following collection of related materials.

UNT Theses and Dissertations

Theses and dissertations represent a wealth of scholarly and artistic content created by masters and doctoral students in the degree-seeking process. Some ETDs in this collection are restricted to use by the UNT community.

What responsibilities do I have when using this dissertation?

When

Dates and time periods associated with this dissertation.

Creation Date

  • May 2008

Added to The UNT Digital Library

  • Oct. 2, 2008, 4:42 p.m.

Description Last Updated

  • Jan. 21, 2014, 1:49 p.m.

Usage Statistics

When was this dissertation last used?

Yesterday: 0
Past 30 days: 6
Total Uses: 679

Interact With This Dissertation

Here are some suggestions for what to do next.

Start Reading

PDF Version Also Available for Download.

Citations, Rights, Re-Use

Csomai, Andras. Keywords in the mist: Automated keyword extraction for very large documents and back of the book indexing., dissertation, May 2008; Denton, Texas. (digital.library.unt.edu/ark:/67531/metadc6118/: accessed September 22, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; .