Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various Approaches Page: 4
v, 92 pages : illustrations (some color)View a full description of this dissertation.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
Spanish can still determine these to be two of the languages in question of the third and first
items. While this may very well be the case with a select few languages such as Dunning's
example above, consider the following example.
Given the following 20 character strings,
ca urmare publicarea
ezzel a magyar nyelv
sig inte som en egen
would it be "hardly" surprising for a person to recognize these strings as Romanian, Hungarian,
and Swedish, respectively? I believe the answer is that it would be just the contrary-quite
surprising for a person to accurately identify the three languages above. Just as in Dunning
(1994), the above texts are selected to purposefully exclude any diacritical marks or special
characters peculiar to the languages. Even so, consider how many persons with little or no
knowledge of Romanian, Hungarian, and Swedish could accurately identify the above languages.
The answer is very few, if any. In three cases, I present the three strings above to three separate
acquaintances, and in every case, no single string is accurately identified. The fact is, the few
persons who could accurately identify the above strings would likely be either polyglots or
trained linguists.
With the aforementioned increase in languages present on the World Wide Web and in digital
media, it is necessary, more than ever, to be able to accurately identify the languages of these
texts in order to process them for any purpose, including obtaining accurate language
identification for metadata records creation.4
Upcoming Pages
Here’s what’s next.
Search Inside
This dissertation can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Dissertation.
Knudson, Ryan Charles. Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various Approaches, dissertation, May 2015; Denton, Texas. (https://digital.library.unt.edu/ark:/67531/metadc801895/m1/10/: accessed July 16, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; .