Improving Access to Web Archives through Innovative Analysis of PDF Content

PDF Version Also Available for Download.

Description

This paper discusses improving access to web archives through innovative analysis of PDF content. The paper discusses the overall workflow and describes the tools used to extract document features. Findings suggest opportunities for the development of retrieval tools that will provide new ways of selecting content and building collections from large Web archives.

Physical Description

7 p.

Creation Information

Phillips, Mark Edward & Murray, Kathleen R. April 2013.

Context

This paper is part of the collection entitled: UNT Scholarly Works and was provided by the UNT Libraries to the UNT Digital Library, a digital repository hosted by the UNT Libraries. It has been viewed 688 times. More information about this paper can be viewed below.

Who

People and organizations associated with either the creation of this paper or its content.

Authors

Provided By

UNT Libraries

The UNT Libraries serve the university and community by providing access to physical and online collections, fostering information literacy, supporting academic research, and much, much more.

Contact Us

What

Descriptive information to help identify this paper. Follow the links below to find similar items on the Digital Library.

Degree Information

Description

This paper discusses improving access to web archives through innovative analysis of PDF content. The paper discusses the overall workflow and describes the tools used to extract document features. Findings suggest opportunities for the development of retrieval tools that will provide new ways of selecting content and building collections from large Web archives.

Physical Description

7 p.

Notes

Reprinted with permission of IS&T: The Society for Imaging Science and Technology sole copyright owners of 'IS&T's Archiving 2013.'

Abstract: In 2008 five United States institutions collaborated to archive the U.S. federal government Web presence: the Library of Congress, the Internet Archive, the California Digital Library, the Government Printing Office, and the University of North Texas (UNT). Their objective was to document the changes coincident with the shift in leadership of the U.S. executive branch. The five partners identified key resources from the U.S. .gov Top Level Domain and completed crawls from September 2008 until March 2009. The resulting End of Term (EOT) 2008 Web Archive, a 16 TB dataset, was distributed to partners interested in providing local services and access to the archive. The UNT Libraries investigated Portable Document Format (PDF) files, a class of content many information professionals associate with the traditional notion of “discrete documents”. Over four million unique PDF documents were extracted from the Archive and a series of metadata and information extraction processes were conducted for each document. Additionally, derivative raster images of the first page of each document were created. These metrics were ingested into a database for further analysis, which brought to light previously hidden characteristics of the federal government’s Web-published content. The paper discusses the overall workflow and describes the tools used to extract document features. Findings suggest opportunities for the development of retrieval tools that will provide new ways of selecting content and building collections from large Web archives.

Source

  • IS & T--the Society for Imaging Science and Technology Archiving Conference, April 2-5, 2013. Washington, D.C., United States

Language

Item Type

Identifier

Unique identifying numbers for this paper in the Digital Library or other systems.

Relationships

Collections

This paper is part of the following collection of related materials.

UNT Scholarly Works

Materials from the UNT community's research, creative, and scholarly activities and UNT's Open Access Repository. Access to some items in this collection may be restricted.

Related Items

Improving Access to Web Archives through Innovative Analysis of PDF Content (Presentation)

Improving Access to Web Archives through Innovative Analysis of PDF Content

This presentation discusses improving access to web archives through innovative analysis of PDF content. It includes a background of the End of Term (EOT) 2008 Presidential Web Archive, a collaborative web archiving project, collection development with web archive content, and the workflow and processes involved in these projects.

Relationship to this item: (Is Version Of)

Improving Access to Web Archives through Innovative Analysis of PDF Content, ark:/67531/metadc155638

What responsibilities do I have when using this paper?

When

Dates and time periods associated with this paper.

Creation Date

  • April 2013

Added to The UNT Digital Library

  • April 16, 2013, 8:47 a.m.

Description Last Updated

  • Dec. 4, 2023, 11:56 a.m.

Usage Statistics

When was this paper last used?

Yesterday: 0
Past 30 days: 0
Total Uses: 688

Interact With This Paper

Here are some suggestions for what to do next.

Start Reading

PDF Version Also Available for Download.

International Image Interoperability Framework

IIF Logo

We support the IIIF Presentation API

Phillips, Mark Edward & Murray, Kathleen R. Improving Access to Web Archives through Innovative Analysis of PDF Content, paper, April 2013; (https://digital.library.unt.edu/ark:/67531/metadc155622/: accessed February 8, 2025), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; .

Back to Top of Screen