This paper discusses improving access to web archives through innovative analysis of PDF content. The paper discusses the overall workflow and describes the tools used to extract document features. Findings suggest opportunities for the development of retrieval tools that will provide new ways of selecting content and building collections from large Web archives.
The UNT Libraries serve the university and community by providing access to physical and online collections, fostering information literacy, supporting academic research, and much, much more.
This paper discusses improving access to web archives through innovative analysis of PDF content. The paper discusses the overall workflow and describes the tools used to extract document features. Findings suggest opportunities for the development of retrieval tools that will provide new ways of selecting content and building collections from large Web archives.
Physical Description
7 p.
Notes
Reprinted with permission of IS&T: The Society for Imaging Science and Technology sole copyright owners of 'IS&T's Archiving 2013.'
Abstract: In 2008 five United States institutions collaborated to archive the U.S. federal government Web presence: the Library of Congress, the Internet Archive, the California Digital Library, the Government Printing Office, and the University of North Texas (UNT). Their objective was to document the changes coincident with the shift in leadership of the U.S. executive branch. The five partners identified key resources from the U.S. .gov Top Level Domain and completed crawls from September 2008 until March 2009. The resulting End of Term (EOT) 2008 Web Archive, a 16 TB dataset, was distributed to partners interested in providing local services and access to the archive. The UNT Libraries investigated Portable Document Format (PDF) files, a class of content many information professionals associate with the traditional notion of “discrete documents”. Over four million unique PDF documents were extracted from the Archive and a series of metadata and information extraction processes were conducted for each document. Additionally, derivative raster images of the first page of each document were created. These metrics were ingested into a database for further analysis, which brought to light previously hidden characteristics of the federal government’s Web-published content. The paper discusses the overall workflow and describes the tools used to extract document features. Findings suggest opportunities for the development of retrieval tools that will provide new ways of selecting content and building collections from large Web archives.
Improving Access to Web Archives through Innovative Analysis of PDF Content, ark:/67531/metadc155638
Collections
This paper is part of the following collection of related materials.
UNT Scholarly Works
Materials from the UNT community's research, creative, and scholarly activities and UNT's Open Access Repository. Access to some items in this collection may be restricted.
This presentation discusses improving access to web archives through innovative analysis of PDF content. It includes a background of the End of Term (EOT) 2008 Presidential Web Archive, a collaborative web archiving project, collection development with web archive content, and the workflow and processes involved in these projects.
Relationship to this item: (Is Version Of)
Improving Access to Web Archives through Innovative Analysis of PDF Content, ark:/67531/metadc155638