Mapping Texts: Combining Text-Mining and Geo-Visualization To Unlock The Research Potential of Historical Newspapers Page: 13
53 p.View a full description of this paper.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
process because the widespread use of hyphenation and word breaks (such as "pre-diction" for
"prediction") which newspaper editors have long used to fit their texts into narrow columns.
OCR on clean images of historical newspapers can achieve high levels of accuracy, but
poorly imaged pages can produce low levels of OCR recognition and accuracy. These limitations,
therefore, often introduce mistakes into scanned texts (such as replacing "I" with "1" as in
"limitations" for "limitations"). That can matter enormously for a researcher attempting to
determine how often a certain term was used in a particular location or time period. If poor
imaging-and therefore OCR results-meant that "Lincoln" was often rendered as "Lincoln" in a
data set, that should affect how a scholar researching newspaper patterns surrounding Abraham
Lincoln would go about his or her work.
As a result, we needed to develop methods for allowing researchers to parse not just the
quantity of the OCR data, but also some measure of its quality as well. We therefore set about
experimenting with developing a transparent model for exposing the quantity and quality of
information in our newspapers database.
SCRUBBING THE OCR
Because the newspaper corpus was so large, we had to develop programmatic methods of
formatting and assessing the data. Our first task was to scrub the corpus and try to correct simple
recurring errors introduced by the OCR process:
* Common misspellings introduced by OCR could be detected and corrected, for example, by
systematically comparing the words in our corpus to English-language dictionaries. For this
task, we used the GNU Aspell dictionary (which is freely available and fully compatible with
13
Upcoming Pages
Here’s what’s next.
Search Inside
This paper can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Paper.
Torget, Andrew J., 1978-; Mihalcea, Rada, 1974-; Christensen, Jon & McGhee, Geoff. Mapping Texts: Combining Text-Mining and Geo-Visualization To Unlock The Research Potential of Historical Newspapers, paper, 2011; (https://digital.library.unt.edu/ark:/67531/metadc83797/m1/13/: accessed April 19, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT College of Arts and Sciences.