End of Term Web Archive Dataset: Longitudinal Web Archive of .GOV and .MIL Domains

PDF Version Also Available for Download.

Description

Article describes how the End of Term (EOT) Web Archive Dataset presents a longitudinal dataset of the US federal web which includes publicly available .gov and .mil domains, created during the 2008, 2012, 2016, and 2020 presidential elections in the United States. The authors describe how based on the End of Term Web Archive, this dataset presents 461TB of WARC data and accompanying derivative files in WAT, WET, and CDX format.

Physical Description

4 p.

Creation Information

Phillips, Mark Edward; Phillips, Kristy & Alam, Sawood October 3, 2023.

Context

This article is part of the collection entitled: UNT Scholarly Works and was provided by the UNT College of Information to the UNT Digital Library, a digital repository hosted by the UNT Libraries. It has been viewed 36 times. More information about this article can be viewed below.

Who

People and organizations associated with either the creation of this article or its content.

Authors

Provided By

UNT College of Information

Situated at the intersection of people, technology, and information, the College of Information's faculty, staff and students invest in innovative research, collaborative partnerships, and student-centered education to serve a global information society. The college offers programs of study in information science, learning technologies, and linguistics.

Contact Us

What

Descriptive information to help identify this article. Follow the links below to find similar items on the Digital Library.

Degree Information

Description

Article describes how the End of Term (EOT) Web Archive Dataset presents a longitudinal dataset of the US federal web which includes publicly available .gov and .mil domains, created during the 2008, 2012, 2016, and 2020 presidential elections in the United States. The authors describe how based on the End of Term Web Archive, this dataset presents 461TB of WARC data and accompanying derivative files in WAT, WET, and CDX format.

Physical Description

4 p.

Notes

Abstract: The End of Term (EOT) Web Archive Dataset presents a longitudinal dataset of the US federal web which includes publicly available .gov and .mil domains, created during the 2008, 2012, 2016, and 2020 presidential elections in the United States. Based on the End of Term Web Archive, this dataset presents 461TB of WARC data and accompanying derivative files in WAT, WET, and CDX format. A metadata sidecar file is also provided that contains content-based characterizations, including languages, character sets, format identifiers, and mime types. In addition to these derivative formats, CDX indexes in the ZipNum and Parquet formats that provide additional functionality to the dataset are included. The EOT dataset is freely available on the Amazon S3 platform as part of the Amazon Open Data Program.

Source

  • 2023 ACM/IEEE Joint Conference on Digital Libraries, Institute of Electrical and Electronics Engineers, October 3, 2023, pp. 1-4
  • 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL), June 26-30, 2023. Santa Fe, NM, United States

Language

Item Type

Identifier

Unique identifying numbers for this article in the Digital Library or other systems.

Publication Information

  • Publication Title: 2023 ACM/IEEE Joint Conference on Digital Libraries
  • Page Start: 98
  • Page End: 101

Relationships

Collections

This article is part of the following collection of related materials.

UNT Scholarly Works

Materials from the UNT community's research, creative, and scholarly activities and UNT's Open Access Repository. Access to some items in this collection may be restricted.

Related Items

End of Term Web Archive Dataset: Longitudinal Web Archive of .GOV and .MIL Domains (Presentation)

End of Term Web Archive Dataset: Longitudinal Web Archive of .GOV and .MIL Domains

Presentation describes how the End of Term (EOT) Web Archive Dataset presents a longitudinal dataset of the US federal web which includes publicly available .gov and .mil domains, created during the 2008, 2012, 2016, and 2020 presidential elections in the United States. The authors describe how based on the End of Term Web Archive, this dataset presents 461TB of WARC data and accompanying derivative files in WAT, WET, and CDX format.

End of Term Web Archive Dataset: Longitudinal Web Archive of .GOV and .MIL Domains - ark:/67531/metadc2201582

What responsibilities do I have when using this article?

When

Dates and time periods associated with this article.

Creation Date

  • October 3, 2023

Added to The UNT Digital Library

  • Dec. 14, 2023, 5:14 a.m.

Description Last Updated

  • Jan. 8, 2024, 12:55 p.m.

Usage Statistics

When was this article last used?

Yesterday: 2
Past 30 days: 3
Total Uses: 36

Interact With This Article

Here are some suggestions for what to do next.

Start Reading

PDF Version Also Available for Download.

International Image Interoperability Framework

IIF Logo

We support the IIIF Presentation API

Phillips, Mark Edward; Phillips, Kristy & Alam, Sawood. End of Term Web Archive Dataset: Longitudinal Web Archive of .GOV and .MIL Domains, article, October 3, 2023; (https://digital.library.unt.edu/ark:/67531/metadc2201613/: accessed March 26, 2025), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT College of Information.

Back to Top of Screen