Topic Modeling on Historical Newspapers Page: 1
9 p.View a full description of this paper.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
Topic Modeling on Historical Newspapers
Tze-I Yang
Dept. of Comp. Sci. & Eng.
University of North Texas
tze-iyang@my.unt.eduAndrew J. Torget
Dept. of History
University of North Texas
andrew.torget@unt.eduRada Mihalcea
Dept. of Comp. Sci. & Eng.
University of North Texas
rada@cs. unt. eduAbstract
In this paper, we explore the task of automatic
text processing applied to collections of his-
torical newspapers, with the aim of assisting
historical research. In particular, in this first
stage of our project, we experiment with the
use of topical models as a means to identify
potential issues of interest for historians.
1 Newspapers in Historical Research
Surviving newspapers are among the richest sources
of information available to scholars studying peo-
ples and cultures of the past 250 years, particularly
for research on the history of the United States.
Throughout the nineteenth and twentieth centuries,
newspapers served as the central venues for nearly
all substantive discussions and debates in American
society. By the mid-nineteenth century, nearly every
community (no matter how small) boasted at least
one newspaper. Within these pages, Americans ar-
gued with one another over politics, advertised and
conducted economic business, and published arti-
cles and commentary on virtually all aspects of so-
ciety and daily life. Only here can scholars find edi-
torials from the 1870s on the latest political contro-
versies, advertisements for the latest fashions, arti-
cles on the latest sporting events, and languid poetry
from a local artist, all within one source. Newspa-
pers, in short, document more completely the full
range of the human experience than nearly any other
source available to modern scholars, providing win-
dows into the past available nowhere else.
Despite their remarkable value, newspapers have
long remained among the most underutilized histor-ical resources. The reason for this paradox is quite
simple: the sheer volume and breadth of informa-
tion available in historical newspapers has, ironi-
cally, made it extremely difficult for historians to
go through them page-by-page for a given research
project. A historian, for example, might need to
wade through tens of thousands of newspaper pages
in order to answer a single research question (with
no guarantee of stumbling onto the necessary infor-
mation).
Recently, both the research potential and prob-
lem of scale associated with historical newspapers
has expanded greatly due to the rapid digitization of
these sources. The National Endowment for the Hu-
manities (NEH) and the Library of Congress (LOC),
for example, are sponsoring a nationwide historical
digitization project, Chronicling America, geared to-
ward digitizing all surviving historical newspapers
in the United States, from 1836 to the present. This
project recently digitized its one millionth page (and
they project to have more than 20 million pages
within a few years), opening a vast wealth of his-
torical newspapers in digital form.
While projects such as Chronicling America have
indeed increased access to these important sources,
they have also increased the problem of scale that
have long prevent scholars from using these sources
in meaningful ways. Indeed, without tools and
methods capable of handling such large datasets -
and thus sifting out meaningful patterns embedded
within them - scholars find themselves confined to
performing only basic word searches across enor-
mous collections. These simple searches can, in-
deed, find stray information scattered in unlikely
Upcoming Pages
Here’s what’s next.
Search Inside
This paper can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Paper.
Yang, Tze-I; Torget, Andrew J., 1978- & Mihalcea, Rada, 1974-. Topic Modeling on Historical Newspapers, paper, June 2011; (https://digital.library.unt.edu/ark:/67531/metadc83799/m1/1/: accessed April 19, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT College of Arts and Sciences.