End of Term Web Archive Dataset: Longitudinal Web Archive of .GOV and .MIL Domains
PDF Version Also Available for Download.
Description
Article describes how the End of Term (EOT) Web Archive Dataset presents a longitudinal dataset of the US federal web which includes publicly available .gov and .mil domains, created during the 2008, 2012, 2016, and 2020 presidential elections in the United States. The authors describe how based on the End of Term Web Archive, this dataset presents 461TB of WARC data and accompanying derivative files in WAT, WET, and CDX format.
Situated at the intersection of people, technology, and information, the College of Information's faculty, staff and students invest in innovative research, collaborative partnerships, and student-centered education to serve a global information society. The college offers programs of study in information science, learning technologies, and linguistics.
Article describes how the End of Term (EOT) Web Archive Dataset presents a longitudinal dataset of the US federal web which includes publicly available .gov and .mil domains, created during the 2008, 2012, 2016, and 2020 presidential elections in the United States. The authors describe how based on the End of Term Web Archive, this dataset presents 461TB of WARC data and accompanying derivative files in WAT, WET, and CDX format.
Physical Description
4 p.
Notes
Abstract: The End of Term (EOT) Web Archive Dataset presents a longitudinal dataset of the US federal web which includes publicly available .gov and .mil domains, created during the 2008, 2012, 2016, and 2020 presidential elections in the United States. Based on the End of Term Web Archive, this dataset presents 461TB of WARC data and accompanying derivative files in WAT, WET, and CDX format. A metadata sidecar file is also provided that contains content-based characterizations, including languages, character sets, format identifiers, and mime types. In addition to these derivative formats, CDX indexes in the ZipNum and Parquet formats that provide additional functionality to the dataset are included. The EOT dataset is freely available on the Amazon S3 platform as part of the Amazon Open Data Program.
Publication Title:
2023 ACM/IEEE Joint Conference on Digital Libraries
Page Start:
98
Page End:
101
Relationships
End of Term Web Archive Dataset: Longitudinal Web Archive of .GOV and .MIL Domains - ark:/67531/metadc2201582
Collections
This article is part of the following collection of related materials.
UNT Scholarly Works
Materials from the UNT community's research, creative, and scholarly activities and UNT's Open Access Repository. Access to some items in this collection may be restricted.
Presentation describes how the End of Term (EOT) Web Archive Dataset presents a longitudinal dataset of the US federal web which includes publicly available .gov and .mil domains, created during the 2008, 2012, 2016, and 2020 presidential elections in the United States. The authors describe how based on the End of Term Web Archive, this dataset presents 461TB of WARC data and accompanying derivative files in WAT, WET, and CDX format.
End of Term Web Archive Dataset: Longitudinal Web Archive of .GOV and .MIL Domains - ark:/67531/metadc2201582
Phillips, Mark Edward; Phillips, Kristy & Alam, Sawood.End of Term Web Archive Dataset: Longitudinal Web Archive of .GOV and .MIL Domains,
article,
October 3, 2023;
(https://digital.library.unt.edu/ark:/67531/metadc2201613/:
accessed March 26, 2025),
University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu;
crediting UNT College of Information.