Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various Approaches Page: 25
v, 92 pages : illustrations (some color)View a full description of this dissertation.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
identification of character N-grams with N=5. Their data consisted of query-type strings with an
average length of 7.2 words of on average 6-7 characters long taken from newswire texts.
Vector-space model algorithm for language identification
Another approach to language identification seen in the literature is a vector-space model
approach, which assigns a value to a language's vector-space as well as the test text's vector
space and compares the values using the cosine similarity measure. This approach has also been
successful in accurately identifying the language of a text. Gottron and Lipka (2010) used a
vector space model to identify the language of news headlines at an accuracy level of 54.68%
for character N-grams with N=1 and at an accuracy level of 75.37% for character N-grams with
N=5.
Dunning (1994) attempted to distinguish Spanish and English texts ranging in length from 10
bytes to 500 bytes. While Dunning's 99.9% accuracy using Bayesian method with Markov
modeling may seem impressive, just as many of the achieved accuracies using various
algorithms would, such accuracies would be difficult to achieve were the number of languages
considered to increase even slightly (Baldwin & Lui, 2010). For example, if one were to run the
same controlled tests on a group of documents comprised of five or six different languages,
rather than just two, this level of accuracy would likely not be reached. This accuracy would also
be less likely with shorter texts to be identified. The problem of identifying the language of
relatively short texts is discussed in Dunning (1994), Baldwin and Lui (2010), and Ceylan and Kim
(2009). Ceylan and Kim (2009) worked on identifying the language of search engine queries
ranging in length from two to three words.25
Upcoming Pages
Here’s what’s next.
Search Inside
This dissertation can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Dissertation.
Knudson, Ryan Charles. Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various Approaches, dissertation, May 2015; Denton, Texas. (https://digital.library.unt.edu/ark:/67531/metadc801895/m1/31/: accessed July 16, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; .