Improving Access to Web Archives through Innovative Analysis of PDF Content

PDF Version Also Available for Download.

Description

This paper discusses improving access to web archives through innovative analysis of PDF content. The paper discusses the overall workflow and describes the tools used to extract document features. Findings suggest opportunities for the development of retrieval tools that will provide new ways of selecting content and building collections from large Web archives.

Physical Description

7 p.

Creation Information

Phillips, Mark Edward & Murray, Kathleen R. April 2013.

Context

This paper is part of the collection entitled: UNT Scholarly Works and was provided by UNT Libraries to Digital Library, a digital repository hosted by the UNT Libraries. It has been viewed 280 times . More information about this paper can be viewed below.

Who

People and organizations associated with either the creation of this paper or its content.

Authors

Rights Holders

For guidance see Citations, Rights, Re-Use.

  • Unknown

Provided By

UNT Libraries

Library facilities at the University of North Texas function as the nerve center for teaching and academic research. In addition to a major collection of electronic journals, books and databases, five campus facilities house just under six million cataloged holdings, including books, periodicals, maps, documents, microforms, audiovisual materials, music scores, full-text journals and books. A branch library is located at the University of North Texas Dallas Campus.

Contact Us

What

Descriptive information to help identify this paper. Follow the links below to find similar items on the Digital Library.

Degree Information

Description

This paper discusses improving access to web archives through innovative analysis of PDF content. The paper discusses the overall workflow and describes the tools used to extract document features. Findings suggest opportunities for the development of retrieval tools that will provide new ways of selecting content and building collections from large Web archives.

Physical Description

7 p.

Notes

Reprinted with permission of IS&T: The Society for Imaging Science and Technology sole copyright owners of 'IS&T's Archiving 2013.'

Abstract: In 2008 five United States institutions collaborated to archive the U.S. federal government Web presence: the Library of Congress, the Internet Archive, the California Digital Library, the Government Printing Office, and the University of North Texas (UNT). Their objective was to document the changes coincident with the shift in leadership of the U.S. executive branch. The five partners identified key resources from the U.S. .gov Top Level Domain and completed crawls from September 2008 until March 2009. The resulting End of Term (EOT) 2008 Web Archive, a 16 TB dataset, was distributed to partners interested in providing local services and access to the archive. The UNT Libraries investigated Portable Document Format (PDF) files, a class of content many information professionals associate with the traditional notion of “discrete documents”. Over four million unique PDF documents were extracted from the Archive and a series of metadata and information extraction processes were conducted for each document. Additionally, derivative raster images of the first page of each document were created. These metrics were ingested into a database for further analysis, which brought to light previously hidden characteristics of the federal government’s Web-published content. The paper discusses the overall workflow and describes the tools used to extract document features. Findings suggest opportunities for the development of retrieval tools that will provide new ways of selecting content and building collections from large Web archives.

Source

  • IS & T--the Society for Imaging Science and Technology Archiving Conference, 2013, Washington, D.C., United States

Language

Item Type

Collections

This paper is part of the following collection of related materials.

UNT Scholarly Works

The Scholarly Works Collection is home to materials from the University of North Texas community's research, creative, and scholarly activities and serves as UNT's Open Access Repository. It brings together articles, papers, artwork, music, research data, reports, presentations, and other scholarly and creative products representing the expertise in our university community.** Access to some items in this collection may be restricted.**

Related Items

Improving Access to Web Archives through Innovative Analysis of PDF Content (Presentation)

Improving Access to Web Archives through Innovative Analysis of PDF Content

This presentation discusses improving access to web archives through innovative analysis of PDF content. It includes a background of the End of Term (EOT) 2008 Presidential Web Archive, a collaborative web archiving project, collection development with web archive content, and the workflow and processes involved in these projects.

Relationship to this item: (Is Version Of)

What responsibilities do I have when using this paper?

When

Dates and time periods associated with this paper.

Creation Date

  • April 2013

Added to The UNT Digital Library

  • April 16, 2013, 8:47 a.m.

Description Last Updated

  • Jan. 9, 2015, 4:50 p.m.

Usage Statistics

When was this paper last used?

Yesterday: 0
Past 30 days: 0
Total Uses: 280

Interact With This Paper

Here are some suggestions for what to do next.

Start Reading

PDF Version Also Available for Download.

Citations, Rights, Re-Use

Phillips, Mark Edward & Murray, Kathleen R. Improving Access to Web Archives through Innovative Analysis of PDF Content, paper, April 2013; (digital.library.unt.edu/ark:/67531/metadc155622/: accessed March 28, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; .