Metadata Analysis at the Command-Line Page: 1
The following text was automatically extracted from the image on this page using optical character recognition software:
code ' 4 lib
Issue 19, 2013-01-15 ISSN 1940-5758
Metadata Analysis at the Command-Line
Over the past few years the University of North Texas Libraries' Digital Projects Unit (DPU) has developed a set of
metadata analysis tools, processes, and methodologies aimed at helping to focus limited quality control resources on the
areas of the collection where they might have the most benefit. The key to this work lies in its simplicity: records harvested
from OAI-PMH-enabled digital repositories are transformed into a format that makes them easily parsable using traditional
Unix/Linux-based command-line tools. This article describes the overall methodology, introduces two simple open-source
tools developed to help with the aforementioned harvesting and breaking, and provides example commands to demonstrate
some common metadata analysis requests. All software tools described in the article are available with an open-source
license via the author's GitHub account
by Mark Phillips
The UNT Libraries' Digital Libraries Division is responsible for the creation and quality review of the majority of metadata
records in the UNT Libraries' Digital Collections. These collections contain items of similar format to other university library
collections of comparable size. Items in the collections include digitized and born-digital photographs, letters, documents,
maps, ledgers, technical reports, and theses and dissertations. The size and scope of these collections continue to grow
at an increasing rate for the past three years measuring 83,000, 93,000, and 120,000 items added per year for the fiscal
years 2010, 2011, and 2012. The continued growth in these collections means that there are a greater number of
metadata records created by an increasing number of metadata creators, which in turn causes a wider variance in quality.
The need to analyze and report statistics for these metadata records has lead the UNT Libraries to develop new tools and
processes to ensure that high quality metadata records are used throughout its digital library collections.
This article describes an approach in use at the UNT Libraries for harvesting metadata records from an OAI-PMH
repository and then transforming them into a simpler text format, which can easily be consumed by a number of standard
command-line tools available freely on most Unix and Linux based systems. Metadata quality, as defined for this article,
falls into three major areas.
Collection Level Analytics - How many of something are in the entire collection of metadata records? For example, how
many unique creators are represented by the collection? Which creator is associated with the most items? What item in
the collection has the most creator, subject or coverage instances? These analytics are helpful in communicating metrics
about the collection to others.
Metadata Completeness - How well does the collection's item-level metadata conform to various measures of
completeness? What are required fields for a given subset of metadata, and how well do the collection's metadata records
adhere to these requirements? How does this collection of metadata meet both metadata creator and metadata consumer
ideals of value?
Metadata Value Quality - Based on local requirements, how well do the values within a given metadata record match
those of standard or defined metadata specifications? For example, if a collection of metadata records utilizes the
Extended Date Time Format for the date values in the collection, how well does the metadata collection meet the
requirement of that format? Which values need to be changed in order to meet more of the requirements?
The tools and methodology explained in this article provide a way of identifying these areas of metadata cleanup to focus
our attention, and help answer the question, "If there is a limited amount of time to spend on metadata cleanup, what is the
best use of this time?"
Here’s what’s next.
This article can be searched. Note: Results may vary based on the legibility of text within the document.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Article.
Phillips, Mark Edward. Metadata Analysis at the Command-Line, article, January 15, 2013; (digital.library.unt.edu/ark:/67531/metadc157309/m1/1/: accessed February 23, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; .