Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various Approaches Page: 5
v, 92 pages : illustrations (some color)View a full description of this dissertation.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
A recent project funded by the Institute of Museum and Library Services (IMLS)
(www.imls.gov/), required a group of researchers, including me, to ascertain the language of
two million metadata records as part of a plan to translate only English records into both
Spanish and Chinese. For this purpose, I developed a language identification program to
separate records determined to be English from records determined to be non-English.
Experiments with open-source language identification programs proved to result in less
accurate language identification than reached by the in-house program.
Identifying the language of metadata records may be for one of two reasons. Firstly, supplying a
value for metadata schema that contain a "language" element; and secondly, translation efforts,
whether manual or by machine.
Now take a moment to consider the former reason above. Few multilingual digital collections
have more than approximately five languages, and even fewer have polyglot staff members to
identify languages or translate materials. The World Digital library, e.g., has a collection
including contents in over 100 languages and a fully functional interface in seven languages
("About the World", n.d.). Another multilingual digital collection is the International Children's
Digital Library, which includes 4619 books in 59 languages and a user interface available in five
languages ("Library fast facts", n.d.). However, most libraries and digital collections, as noted
above, do not have a multilingual staff necessary for these types of collections.
Consider a public library in suburban Texas where a recent immigrant from Bulgaria has just
moved to this suburban area and wishes to locate some readings in his or her mother tongue-
Bulgarian. When the recent immigrant searches the collection for Bulgarian titles, with the help
of a library staff member, only the metadata records which have an accurate value in the5
Upcoming Pages
Here’s what’s next.
Search Inside
This dissertation can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Dissertation.
Knudson, Ryan Charles. Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various Approaches, dissertation, May 2015; Denton, Texas. (https://digital.library.unt.edu/ark:/67531/metadc801895/m1/11/: accessed July 16, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; .