The following text was automatically extracted from the image on this page using optical character recognition software:
Analyzing the Persistence of Referenced
Web Resources with Memento
Robert Sanderson
Los Alamos National Laboratory
Los Alamos
NM 87544, USA
+1 (505) 665-5804
rsanderson@lanl.gov
Mark Phillips
University of North Texas
Denton
TX 76203, USA
+1 (960) 565-2415
Mark. Phillips @ unt.edu
Herbert Van de Sompel
Los Alamos National Laboratory
Los Alamos
NM 87544, USA
+1 (505) 667-1267
herbertv@lanl.gov
ABSTRACT
In this paper we present the results of a study into the persistence
and availability of web resources referenced from papers in
scholarly repositories. Two repositories with different
characteristics, arXiv and the UNT digital library, are studied to
determine if the nature of the repository, or of its content, has a
bearing on the availability of the web resources cited by that
content. Memento makes it possible to automate discovery of
archived resources and to consider the time between the
publication of the research and the archiving of the referenced
URLs. This automation allows us to process more than 160000
URLs, the largest known such study, and the repository metadata
allows consideration of the results by discipline. The results are
startling: 45% (66096) of the URLs referenced from arXiv still
exist, but are not preserved for future generations, and 28% of
resources referenced by UNT papers have been lost. Moving
forwards, we provide some initial recommendations, including
that repositories should publish URL lists extracted from papers
that could be used as seeds for web archiving systems.
Categories and Subject Descriptors
H.5.4 [Information Interfaces and Presentation]: Hypertext/
Hypermedia - Architectures, Navigation.
General Terms
Experimentation
Keywords
Digital Preservation, Repositories, Web Persistence
1. INTRODUCTION
As repositories become more aligned with the web architecture
and links to and from their content proliferate, the role of the
repository moves away from that of a curated content silo and
toward knowledge infrastructure for research. This infrastructure
is the foundation of the entire research community, and with
scholarly communication in the midst of the transition from print
to digital, resource preservation has become an area of increasing
concern.
The current generation of repositories performs the important task
of preserving copies of scholarly research output, however
maintaining access to research inputs is equally crucial to enable
future understanding and reproducibility. Those inputs, both data
and prior work, are increasingly online and often not maintained
within the stable and managed confines of a repository.
This paper considers the extent to which web resources,
referenced throughout academic works in repositories, are still
available. Using the Memento HTTP extensions [15] for access
to historical web content, we can go beyond previous studies and
determine not only if there is still a resource at the cited URL, but
whether or not there are copies in archives and even consider the
difference between the publication date of the citing work and the
time of the closest archived copy. The time difference is a crucial
factor not previously considered, as the information at any given
URL can be modified at any time. A copy is thus more likely to
be an accurate representation of the intended target of the citation
if it is archived close to the publication date of the paper.
This research utilizes two repositories of scholarly communication
of very different types. The first is an institutional repository of
the electronic theses and dissertations of the students of the
University of North Texas1. Although the total number of
documents is relatively low, less than 4000, the subject matter is
wide ranging, covering the full spectrum of a modern university.
The second collection analyzed, arXiv2, is a large multi-
institutional subject repository of published works in High Energy
Physics and related disciplines. The number of documents is two
orders of magnitude higher; in the order of 400000.
By processing the referenced URLs and metadata from works
stored in the two repositories, we aim to address the questions:
1. To what extent are web resources referenced from
works in the repositories still available at their original
URLs or from archives of web resources?
2. How long is the period between the publication of a
paper and the archiving of a resource cited by the paper?
3. Does the nature of the repository or the academic level
and discipline of the work have any bearing on the
previous two questions?
The answers to these questions will determine if there is a need
for repositories to take an active role in the preservation of
referenced web resources and the natures of that role.
2. BACKGROUND WORK
Much research has been invested already in analyzing "link rot"
(when a hyperlink fails to resolve) in academic publications
across various disciplines. These studies give us a baseline
expectation for the availability of resources at their original URLs,
but the sample size is often very small. Previously, the number of
URLs checked to ascertain if the content existed elsewhere was
also limited by the manual nature of such checks.
1 http://digital.library.unt.edu/
2 http://www.arxiv.org/