Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various Approaches Page: 79
v, 92 pages : illustrations (some color)View a full description of this dissertation.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
In this study, I consider five approaches to automatic language identification on a multilingual
test set that I create of book and movie titles. Each approach intends to accurately identify the
languages of titles, modeled after the 21 europarl languages. The approaches I consider include
the following: Cavnar and Trenkle's (1994) N-gram frequency profile and distance measure;
reduced N-gram frequency profile and distance measure; vector-space model; naive Bayes; and
an open-source approach (Python's language ID).
The reduced N-gram frequency profile and distance measure approach outperform all others by
over 6%. This approach accurately identifies 68 of 81 multilingual titles, a total accuracy of
83.95%. This finding demonstrates that the most suitable approach for implementation for
digital collections is the reduced N-gram frequency profile approach.
The study concludes with plans for future research in this area. These include: refining of the
best-performing approach; expanding the domain of the training data; increasing the number of
languages considered; and creating an open-source program for language identification and
offering it to digital collection curators for use on their metadata.79
Upcoming Pages
Here’s what’s next.
Search Inside
This dissertation can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Dissertation.
Knudson, Ryan Charles. Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various Approaches, dissertation, May 2015; Denton, Texas. (https://digital.library.unt.edu/ark:/67531/metadc801895/m1/85/: accessed July 16, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; .