Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various Approaches Page: 6
This dissertation is part of the collection entitled: UNT Theses and Dissertations and was provided to UNT Digital Library by the UNT Libraries.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
language element will be returned by such a search. As mentioned above, the language element
is not always present in metadata records. Let us say an estimated ten documents in the
collection are in fact in Bulgarian, but only three of the ten have a Bulgarian value in the
language element. Recall is a commonly used measure in information retrieval that refers to the
number of relevant documents retrieved by a search divided by the number of relevant
documents in a collection, or: R = rdr/rdc where R is recall, rdr is relevant documents
returned by a search, and rdc is relevant documents in collection. The recall value of the
hypothetical search above for Bulgarian documents would then be .30, or 30%, which is quite
low in terms of state-of-the-art information retrieval. The hypothetical Bulgarian user would
then be incapable of using more than three of the ten Bulgarian documents present in the
hypothetical collection, short of manually inspecting each title in the entire collection, or
performing an exhaustive search.
This problem is indubitably occurring with a frequency correlated to the burgeoning of
documents and digital materials in various languages from around the globe. This demonstrates
a gap in effective search and retrieval in any digital collection, a gap in which users are incapable
of retrieving relevant documents to a query consisting of the language of the content of desired
documents.
Research question
In light of the problem above, I answer the following question: Of the various approaches to
automatic language identification, which one is most effective for the accurate language
identification of metadata records, specifically, the title elements?6
Upcoming Pages
Here’s what’s next.
Search Inside
This dissertation can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Dissertation.
Knudson, Ryan Charles. Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various Approaches, dissertation, May 2015; Denton, Texas. (https://digital.library.unt.edu/ark:/67531/metadc801895/m1/12/?rotate=270: accessed July 16, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; .