Article investigating the relationship between image resolution and OCR (optical character recognition) performance, with a focus on both character-level accuracy and the integrity of subsequent text processing pipelines. The findings have practical implications for document digitization workflows, especially in resource-constrained environments where high-resolution image storage and processing may be questionable. It was presented at the 3rd International Workshop on Digital Language Archives held on December 15-16, 2025 as part of the ACM/IEEE Joint Conference on Digital Libraries 2025.
Situated at the intersection of people, technology, and information, the College of Information's faculty, staff and students invest in innovative research, collaborative partnerships, and student-centered education to serve a global information society. The college offers programs of study in information science, learning technologies, and linguistics.
Article investigating the relationship between image resolution and OCR (optical character recognition) performance, with a focus on both character-level accuracy and the integrity of subsequent text processing pipelines. The findings have practical implications for document digitization workflows, especially in resource-constrained environments where high-resolution image storage and processing may be questionable. It was presented at the 3rd International Workshop on Digital Language Archives held on December 15-16, 2025 as part of the ACM/IEEE Joint Conference on Digital Libraries 2025.
Physical Description
5 p.
Notes
Abstract: Despite advancements in OCR algorithms, the quality of input images remains a critical factor influencing recognition accuracy and subsequent text processing. For digital libraries a question remains open: what is the optimal resolution in which documents should be stored. Obviously, one might expect that the highest resolution would be the best choice but choosing the best input quality has an impact on data storage and computing time and the real influence of image resolution (and size) on OCR and subsequent tasks seems to remain an open question. High-resolution images typically allow OCR engines to better distinguish character features, leading to improved recognition performance. Conversely, low-resolution images often result in increased character ambiguity, misclassifications, and noise, thereby reducing overall OCR reliability. These recognition errors not only compromise the immediate output quality but also propagate into downstream text processing tasks such as information retrieval, named entity recognition, and natural language understanding. In this paper we investigate the relationship between image resolution and OCR performance, with a focus on both character-level accuracy and the integrity of subsequent text processing pipelines. By analyzing OCR outputs across a range of resolutions and evaluating their impact on various post-recognition tasks, we seek to identify resolution thresholds that balance processing efficiency with textual fidelity. The findings have practical implications for document digitization workflows, especially in resource-constrained environments where high-resolution image storage and processing may be questionable.
Publication Title:
Proceedings of the 3rd International Workshop on Digital Language Archives: LangArc 2025
Page Start:
25
Page End:
29
Peer Reviewed:
Yes
Relationships
Proceedings of the International Workshop on Digital Language Archives: LangArc-2025, ark:/67531/metadc2543332
Collections
This article is part of the following collection of related materials.
International Workshop on Digital Language Archives
This interactive workshop explores a broad scope of issues related to digital language archives—digital libraries that preserve and provide online access to language data. The collection includes proceedings and articles from the workshop.
Conference proceedings of the 3rd International Workshop on Digital Language Archives held on December 15-16, 2025 as part of the ACM/IEEE Joint Conference on Digital Libraries 2025. It includes 11 peer-reviewed papers that were presented at the workshop and an introduction from the workshop organizers.
Relationship to this item: (Is Part Of)
Proceedings of the International Workshop on Digital Language Archives: LangArc-2025, ark:/67531/metadc2543332
Boubehziz, Toufik; Koudoro-Parfait, Caroline & Lejeune, Gaël.Assessing the Impact of Image Resolution on OCR Transcription Accuracy,
article,
December 30, 2025;
(https://digital.library.unt.edu/ark:/67531/metadc2543322/:
accessed March 16, 2026),
University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu;
crediting UNT College of Information.