Search Results

Mining Newspaper Archives

Description: This presentation discusses mining online newspaper archives. The topics in this presentation include the types of information found in these newspapers, the technology and standards for digitizing newspapers and offering effective search and navigation, ways to search, view, and browse the newspapers and how to use the search results.
Date: February 2, 2012
Creator: Carlisle, Tara & Murray, Kathleen R.
Item Type: Presentation
Partner: UNT Libraries

Processing Non-English Content

Description: This presentation was presented at the National Digital Newspaper Program (NDNP) Awardee Conference in Washington, D.C. The presentation describes the NDNP New Mexico project's experience encoding ALTO OCR file language codes to enable enhanced discovery of its Spanish language content on the Chronicling America website.
Date: September 27, 2012
Creator: Weidner, Andrew
Item Type: Presentation
Partner: UNT Libraries

Freeform Cursive Handwriting Recognition Using a Clustered Neural Network

Description: Optical character recognition (OCR) software has advanced greatly in recent years. Machine-printed text can be scanned and converted to searchable text with word accuracy rates around 98%. Reasonably neat hand-printed text can be recognized with about 85% word accuracy. However, cursive handwriting still remains a challenge, with state-of-the-art performance still around 75%. Algorithms based on hidden Markov models have been only moderately successful, while recurrent neural networks have delivered the best results to date. This thesis explored the feasibility of using a special type of feedforward neural network to convert freeform cursive handwriting to searchable text. The hidden nodes in this network were grouped into clusters, with each cluster being trained to recognize a unique character bigram. The network was trained on writing samples that were pre-segmented and annotated. Post-processing was facilitated in part by using the network to identify overlapping bigrams that were then linked together to form words and sentences. With dictionary assisted post-processing, the network achieved word accuracy of 66.5% on a small, proprietary corpus. The contributions in this thesis are threefold: 1) the novel clustered architecture of the feed-forward neural network, 2) the development of an expanded set of observers combining image masks, modifiers, and feature characterizations, and 3) the use of overlapping bigrams as the textual working unit to assist in context analysis and reconstruction.
Date: August 2015
Creator: Bristow, Kelly H.
Partner: UNT Libraries

Portal to Texas History Newspaper OCR Text Dataset: Galveston

Description: Dataset of OCR text from The Portal to Texas History and the Texas Digital Newspaper Program. This dataset includes titles from Galveston Texas from the years 1849 to 1897. Titles included in this dataset include: Galveston Weekly News, and The Galveston Daily News. In all there are 8,136 issues comprised of 56,953 pages of text.
Date: November 12, 2015
Creator: Phillips, Mark Edward
Item Type: Dataset
Partner: UNT Libraries

Portal to Texas History Newspaper OCR Text Dataset: Denton

Description: Dataset of OCR text from The Portal to Texas History and the Texas Digital Newspaper Program. This dataset includes titles from Denton Texas from the years 1892 to 1911. Titles included in this dataset include: Denton County News, Denton County Record and Chronicle, Denton Evening News, Legal Tender, Record and Chronicle, The Denton County Record, and The Denton Monitor. In all there are 690 issues comprised of 4,686 pages of text.
Date: November 12, 2015
Creator: Phillips, Mark Edward
Item Type: Dataset
Partner: UNT Libraries

Portal to Texas History Newspaper OCR Text Dataset: San Antonio

Description: Dataset of OCR text from The Portal to Texas History and the Texas Digital Newspaper Program. This dataset includes titles from San Antonio Texas from the years 1874 to 1920. Titles included in this dataset include: San Antonio Daily Express, San Antonio Daily Light, San Antonio Express, The Daily Express, and The San Antonio Light. In all there are 6,866 issues comprised of 130,726 pages of text.
Date: November 12, 2015
Creator: Phillips, Mark Edward
Item Type: Dataset
Partner: UNT Libraries

Portal to Texas History Newspaper OCR Text Dataset: Bryan

Description: Dataset of OCR text from The Portal to Texas History and the Texas Digital Newspaper Program. This dataset includes titles from Bryan Texas from the years 1883 to 1922. Titles included in this dataset include: Bryan Daily Eagle, Bryan Daily Eagle and Pilot, Bryan Morning Eagle, Bryan Morning Eagle and Pilot, The Brazos Weekly Pilot, The Bryan Daily Eagle, The Bryan Eagle, and The Bryan Weekly Eagle and Pilot . In all there are 5,843 issues comprised of 27,360 pages of text.
Date: November 12, 2015
Creator: Phillips, Mark Edward
Item Type: Dataset
Partner: UNT Libraries

Portal to Texas History Newspaper OCR Text Dataset: El Paso

Description: Dataset of OCR text from The Portal to Texas History and the Texas Digital Newspaper Program. This dataset includes titles from El Paso Texas from the years 1881 to 1921. Titles included in this dataset include: El Paso Daily Herald, El Paso Daily Times, El Paso Herald, El Paso International Daily Times, El Paso Morning Times, El Paso Sunday Times, El Paso Times, The El Paso Daily Times, and The El Paso Time. In all there are 17,104 issues comprised of 177,640 pages of text.
Date: November 12, 2015
Creator: Phillips, Mark Edward
Item Type: Dataset
Partner: UNT Libraries

Portal to Texas History Newspaper OCR Text Dataset: Brenham

Description: Dataset of OCR text from The Portal to Texas History and the Texas Digital Newspaper Program. This dataset includes titles from Brenham Texas from the years 1876 to 1923. Titles included in this dataset include: Brenham Banner, Brenham Daily Banner, Brenham Daily Banner-Press, Brenham Evening Press, Brenham Weekly Banner, Brenham WEekly Banner-Press, and The Daily Banner. In all there are 10,720 issues comprised of 50,368 pages of text.
Date: November 12, 2015
Creator: Phillips, Mark Edward
Item Type: Dataset
Partner: UNT Libraries

Portal to Texas History Newspaper OCR Text Dataset: Gainesville

Description: Dataset of OCR text from The Portal to Texas History and the Texas Digital Newspaper Program. This dataset includes titles from Gainesville Texas from the years 1888 to 1897. Titles included in this dataset include: The Daily Hesperian, and The Gainesville Daily Hesperian. In all there are 2,286 issues comprised of 9,359 pages of text.
Date: November 12, 2015
Creator: Phillips, Mark Edward
Item Type: Dataset
Partner: UNT Libraries

Portal to Texas History Newspaper OCR Text Dataset: McKinney

Description: Dataset of OCR text from The Portal to Texas History and the Texas Digital Newspaper Program. This dataset includes titles from McKinney Texas from the years 1880 to 1936. Titles included in this dataset include: Collin County Mercury, McKinney Weekly Democrat-Gazette, The Daily Courier, The Daily Gazette, The Democrat, The Democrat-Gazette, The Lion Roar, The McKinney Advocate, The McKinney Examiner, The McKinney Gazette, The Semi-Weekly Courier, The Southern Jerseyite, and The Weekly Democrat-Gazette. In all there are 1,568 issues comprised of 12,975 pages of text.
Date: November 12, 2015
Creator: Phillips, Mark Edward
Item Type: Dataset
Partner: UNT Libraries

Portal to Texas History Newspaper OCR Text Dataset: Temple

Description: Dataset of OCR text from The Portal to Texas History and the Texas Digital Newspaper Program. This dataset includes titles from Temple Texas from the years 1907 to 1922. Titles included in this dataset include: Temple Daily Telegram. In all there are 4,627 issues comprised of 44,633 pages of text.
Date: November 12, 2015
Creator: Phillips, Mark Edward
Item Type: Dataset
Partner: UNT Libraries

Portal to Texas History Newspaper OCR Text Dataset: Abilene

Description: Dataset of OCR text from The Portal to Texas History and the Texas Digital Newspaper Program. This dataset includes titles from Abilene Texas from the years 1888 to 1923. Titles included in this dataset include: Abilene Daily Reporter, Abilene Morning Reporter, Abilene Semi-Weekly Farm Reporter, Abilene Semi-Weekly Reporter, Abilene Weekly Reporter, The Abilene Reporter, The Abilene Semi-Weekly Reporter, and the Abilene Weekly Reporter. In all there are 7,208 issues comprised of 62,871 pages of text.
Date: November 12, 2015
Creator: Phillips, Mark Edward
Item Type: Dataset
Partner: UNT Libraries

Portal to Texas History Newspaper OCR Text Dataset: Fort Worth

Description: Dataset of OCR text from The Portal to Texas History and the Texas Digital Newspaper Program. This dataset includes titles from Fort Worth Texas from the years 1883 to 1896. Titles included in this dataset include: Fort Worth Daily Gazette, Fort Worth Gazette, and Fort Worth Weekly Gazette. In all there are 4,146 issues comprised of 36,199 pages of text.
Date: November 12, 2015
Creator: Phillips, Mark Edward
Item Type: Dataset
Partner: UNT Libraries

Congressional Globe OCR Dataset

Description: Dataset of OCR text from the Congressional Globe collection in the UNT Digital Library. In all there are 112 volumes and 104,615 pages of text in this dataset.
Date: April 6, 2015
Creator: Phillips, Mark Edward
Item Type: Dataset
Partner: UNT Libraries

Portal to Texas History Newspaper OCR Text Dataset: Houston

Description: Dataset of OCR text from The Portal to Texas History and the Texas Digital Newspaper Program. This dataset includes titles from Houston, Texas from the years 1893 to 1924. Titles included in this dataset include: The Houston Daily Post and The Houston Post. In all there are 9,855 issues comprised of 184,900 pages of text.
Date: November 12, 2015
Creator: Phillips, Mark Edward
Item Type: Dataset
Partner: UNT Libraries