Improving Access to Web Archives through Innovative Analysis of PDF Content

Description:

This paper discusses improving access to web archives through innovative analysis of PDF content. The paper discusses the overall workflow and describes the tools used to extract document features. Findings suggest opportunities for the development of retrieval tools that will provide new ways of selecting content and building collections from large Web archives.

Creator(s):
Creation Date: April 2013
Partner(s):
UNT Libraries
Collection(s):
UNT Scholarly Works
Usage:
Total Uses: 160
Past 30 days: 2
Yesterday: 0
Creator (Author):
Phillips, Mark Edward

University of North Texas

Creator (Author):
Murray, Kathleen R.

University of North Texas

Date(s):
  • Creation: April 2013
Description:

This paper discusses improving access to web archives through innovative analysis of PDF content. The paper discusses the overall workflow and describes the tools used to extract document features. Findings suggest opportunities for the development of retrieval tools that will provide new ways of selecting content and building collections from large Web archives.

Degree:
Department: Libraries
Note:

Reprinted with permission of IS&T: The Society for Imaging Science and Technology sole copyright owners of 'IS&T's Archiving 2013.'

Note:

Abstract: In 2008 five United States institutions collaborated to archive the U.S. federal government Web presence: the Library of Congress, the Internet Archive, the California Digital Library, the Government Printing Office, and the University of North Texas (UNT). Their objective was to document the changes coincident with the shift in leadership of the U.S. executive branch. The five partners identified key resources from the U.S. .gov Top Level Domain and completed crawls from September 2008 until March 2009. The resulting End of Term (EOT) 2008 Web Archive, a 16 TB dataset, was distributed to partners interested in providing local services and access to the archive. The UNT Libraries investigated Portable Document Format (PDF) files, a class of content many information professionals associate with the traditional notion of “discrete documents”. Over four million unique PDF documents were extracted from the Archive and a series of metadata and information extraction processes were conducted for each document. Additionally, derivative raster images of the first page of each document were created. These metrics were ingested into a database for further analysis, which brought to light previously hidden characteristics of the federal government’s Web-published content. The paper discusses the overall workflow and describes the tools used to extract document features. Findings suggest opportunities for the development of retrieval tools that will provide new ways of selecting content and building collections from large Web archives.

Physical Description:

7 p.

Language(s):
Subject(s):
Keyword(s): web archives | portable document formats | collection development | digital objects
Source: IS & T--the Society for Imaging Science and Technology Archiving Conference, 2013, Washington, DC, United States
Contributor(s):
Partner:
UNT Libraries
Collection:
UNT Scholarly Works
Relation (Is Version Of): Improving Access to Web Archives through Innovative Analysis of PDF Content, ark:/67531/metadc155638
Identifier:
  • ARK: ark:/67531/metadc155622
Resource Type: Paper
Format: Text
Rights:
Access: Public