Search Results

Building Specialized Collections from Web Archives

Description: Presentation given at the Artificial Intelligence for Data Discovery and Reuse (AIDR) 2019 conference in Pittsburgh, Pennsylvania. This presentation discusses work on creating datasets of high-value publications and documents from web archives that can be used for machine learning research to help classify these large collections of data.
Date: May 2019
Creator: Caragea, Cornelia & Phillips, Mark Edward
Partner: UNT Libraries
open access

Building Specialized Collections from Web Archiving

Description: Short paper presented at Artificial Intelligence for Data Discovery and Reuse (AIDR). This short paper presents work on creating datasets of high-value publications and documents from web archives that can be used for machine learning research to help classify these large collections of data.
Date: May 2019
Creator: Caragea, Cornelia & Phillips, Mark Edward
Partner: UNT Libraries
open access

Programmatic Extraction of ‘Documents’ from Web Archives: Identifying Document Characteristics from Content Selector Interviews

Description: White paper documenting the results of interviews with professionals who manage collections of state or federal documents, and institutional repositories. These interviews gathered information about collection policies and characteristics of born-digital publications that are incorporated into these bodies of materials, to inform future machine learning algorithms.
Date: 2020
Creator: Fox, Nathaniel T.; Phillips, Mark Edward & Tarver, Hannah
Partner: UNT Libraries

UNT Scholarly Works PDF Dataset

Description: This dataset contains a set of 4,534 PDF files from the UNT Scholarly Works collection, the institutional repository for UNT in the UNT Digital Library. Each PDF has been labeled ForRepo because it has already been chosen for inclusion in the UNT Scholarly Works collection.
Date: September 12, 2018
Creator: Phillips, Mark Edward
Partner: UNT Libraries

Labeled PDF Dataset from End of Term (EOT) 2008 Web Archive

Description: This dataset contains a random sample of 2000 PDF documents from the usda.gov domain in the End of Term (EOT) 2008 Web Archive. These samples were categorized as being of interest for possible inclusion in the Technical Report Archive and Image Library (TRAIL). Each PDF has been sorted into two categories, Technical_Report and Not_Technical_Report.
Date: July 2018
Creator: Kirkwood, Patricia; Phillips, Mark Edward & Caldwell, Christopher
Partner: UNT Libraries

Extracting "Documents" from Web Archives

Description: Presentation was given at the 2019 Texas Conference on Digital Libraries in Austin, Texas. This presentation discusses an IMLS funded research grant to use machine learning techniques to help identify high-value publications from web archives.
Date: May 22, 2019
Creator: Phillips, Mark Edward; Caragea, Cornelia; Patel, Krutarth & Fox, Nathaniel T.
Partner: UNT Libraries

Leveraging Machine Learning to Extract Content-Rich Publications from Web Archives

Description: Poster presented at the 2019 Texas Conference on Digital Libraries (TCDL-2019). This poster discusses about ways of Identifying content-rich documents among the wealth of materials available via web archives. This research attempts to answers the following two research questions: 1. What role do web-published documents and publications play in developing collections in the broad categories of institutional repositories, state government documents, and publications from the federal government? … more
Date: May 22, 2019
Creator: Fox, Nathaniel T. & Phillips, Mark Edward
Partner: UNT Libraries
open access

Programmatic Extraction of ‘Documents' from Web Archives

Description: Data management plan for the grant "Programmatic Extraction of ‘Documents' from Web Archives." This research project seeks to evaluate the use of machine learning algorithms to successfully identify and extract publications contained in existing Web archives. Identifying these documents will empower libraries, archives, and museums to meet their curatorial missions.
Date: 2017-12-01/2020-11-30
Creator: Phillips, Mark Edward & Caragea, Cornelia
Partner: UNT Libraries

Leveraging Machine Learning to Extract Content-Rich Publications from Web Archives

Description: Presentation for the 2019 International Internet Preservation Consortium General Assembly and Web Archiving Conference. This presentation discusses research into leveraging machine learning to identify pdfs relevant to a collection from archived records.
Date: June 6, 2019
Creator: Phillips, Mark Edward; Caragea, Cornelia; Patel, Krutarth & Fox, Nathaniel T.
Partner: International Internet Preservation Consortium

Labeled PDF Dataset from UNT.edu

Description: This dataset contains a random sample of 2000 PDF documents from the Spring 2017 Web Archive of the unt.edu domain. (https://digital.library.unt.edu/ark:/67531/metadc993363/) that have been sorted into two categories, ForRepo and NotForRepo.
Date: November 15, 2017
Creator: Andrews, Pamela & Phillips, Mark Edward
Partner: UNT Libraries
Back to Top of Screen