The enhancement of machine translation for low-density languages using Web-gathered parallel texts.

Description:

The majority of the world's languages are poorly represented in informational media like radio, television, newspapers, and the Internet. Translation into and out of these languages may offer a way for speakers of these languages to interact with the wider world, but current statistical machine translation models are only effective with a large corpus of parallel texts - texts in two languages that are translations of one another - which most languages lack. This thesis describes the Babylon project which attempts to alleviate this shortage by supplementing existing parallel texts with texts gathered automatically from the Web -- specifically targeting pages that contain text in a pair of languages. Results indicate that parallel texts gathered from the Web can be effectively used as a source of training data for machine translation and can significantly improve the translation quality for text in a similar domain. However, the small quantity of high-quality low-density language parallel texts on the Web remains a significant obstacle.

Creator(s): Mohler, Michael Augustine Gaylord
Creation Date: December 2007
Partner(s):
UNT Libraries
Collection(s):
UNT Theses and Dissertations
Usage:
Total Uses: 125
Past 30 days: 0
Yesterday: 0
Creator (Author):
Publisher Info:
Publisher Name: University of North Texas
Place of Publication: Denton, Texas
Date(s):
  • Creation: December 2007
  • Digitized: March 6, 2008
Description:

The majority of the world's languages are poorly represented in informational media like radio, television, newspapers, and the Internet. Translation into and out of these languages may offer a way for speakers of these languages to interact with the wider world, but current statistical machine translation models are only effective with a large corpus of parallel texts - texts in two languages that are translations of one another - which most languages lack. This thesis describes the Babylon project which attempts to alleviate this shortage by supplementing existing parallel texts with texts gathered automatically from the Web -- specifically targeting pages that contain text in a pair of languages. Results indicate that parallel texts gathered from the Web can be effectively used as a source of training data for machine translation and can significantly improve the translation quality for text in a similar domain. However, the small quantity of high-quality low-density language parallel texts on the Web remains a significant obstacle.

Degree:
Level: Master's
Discipline: Computer Science
Language(s):
Subject(s):
Keyword(s): Machine translation | text alignment | biblical texts | Nahuatl | Quechua | parallel texts | low-density languages
Contributor(s):
Partner:
UNT Libraries
Collection:
UNT Theses and Dissertations
Identifier:
  • OCLC: 228503664 |
  • ARK: ark:/67531/metadc5140
Resource Type: Thesis or Dissertation
Format: Text
Rights:
Access: Public
License: Copyright
Holder: Mohler, Michael Augustine Gaylord
Statement: Copyright is held by the author, unless otherwise noted. All rights reserved.