Classification Of The End-Of-Term Archive: Extending Collection Development Practices To Web Archives Page: 10
37 p.View a full description of this report.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
IM LS Award Number LG-06-09-0174-09
Group # Parents % Clusters Average
(75-Cluster Set) Relatedness Category *
1 <2 32% 2.76
2 3-4 35% 2.65
3 5-11 33% 1.69
* 1: little or no relation; 2: somewhat related; 3: strongly relatedTable 5. Average relatedness category for clusters based on number of SuDoc parent authors
There were 39 identical clusters in the 55-set and the 75-set. Seventy-two percent (n = 28) of these
clusters had strongly related content (Table 6; RC3). The 16 remaining clusters in the 55-set subdivided
into 36 clusters in the 75-set. A higher percentage of these 36 clusters were in RC3 (64%) than were the 16
clusters in the 55-set (44%) from which they derived.Relatedness Category *
# Clusters Cluster Set
RC1 RC 2 RC 3
130 Combined sets 21% 18% 61%
39 Identical in both sets 18% 10% 72%
16 Unique to 55-Set 25% 31% 44%
36 Unique to 75-Set 22% 14% 64%
* 1: little or no relation; 2: somewhat related; 3: strongly relatedTable 6. Average relatedness category for clusters based on number of SuDocs parent authors
We found that specifying a larger number of clusters in the cluster analysis algorithm resulted in more
clusters whose members' websites contained content that was strongly related. While the optimal number
of clusters to specify is an unknown, it is helpful to know that more topically related content is likely to be
identified by specifying larger numbers. In our project this translates to numbers greater than the number
of actual parent agencies in the SuDocs scheme. Additionally, clusters that contain the websites of a single
federal government parent agency are more likely to be identified by specifying larger numbers.
Further analysis of the 75 cluster set was done to identify whether the numbers of cluster members, total
SuDocs authors (i.e., both parent and subordinate agencies), or only SuDocs parent authors impacted the
clusters' relatedness categories. As illustrated in Table 7, neither the average numbers nor the ranges for
these three characteristics varied substantially across the relatedness categories. However, there was a
decreasing trend in the average number of SuDoc parents as the relatedness of the clusters increased. This
is consistent with the data reported in Table 3.10
Upcoming Pages
Here’s what’s next.
Search Inside
This report can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Report.
Hartman, Cathy Nelson; Murray, Kathleen R. & Phillips, Mark Edward. Classification Of The End-Of-Term Archive: Extending Collection Development Practices To Web Archives, report, February 2013; (https://digital.library.unt.edu/ark:/67531/metadc152437/m1/12/: accessed April 25, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; .