Web document clustering using hyperlink structures Page: 1 of 22
This report is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided to UNT Digital Library by the UNT Libraries Government Documents Department.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
1
WEB DOCUMENT CLUSTERING USING HYPERLINK
STRUCTURES
XIAOFENG HE*t, HONGYUAN ZHA*, CHRIS H.Q. DINGt AND HORST D. SIMONt
Abstract. With the exponential growth of information on the World Wide Web, there is great
demand for developing efficient and effective methods for organizing and retrieving the information
available. Document clustering plays an important role in information retrieval and taxonomy man-
agement for the World Wide Web and remains an interesting and challenging problem in the field
of web computing. In this paper we consider document clustering methods exploring textual infor-
mation, hyperlink structure and co-citation relations. In particular, we apply the normalized-cut
clustering method developed in computer vision to the task of hyperdocument clustering. We also
explore some theoretical connections of the normalized-cut method to K-means method. We then
experiment with normalized-cut method in the context of clustering query result sets for web search
engines.
Keywords. World Wide Web, graph partitioning, cheeger constant, clustering method, K-means
method, normalized cut method, eigenvalue decomposition, power method.
1. Introduction. Currently the World Wide Web contains billions of documents
and it is still growing rapidly. Finding the relevant documents to satisfy a user's infor-
mation need is a very important and challenging task. Many commercial search en-
gines have been developed and used by millions of people all over the world. However,
the relevancy of documents returned in search engine result sets is still lacking, and
further research and development is needed to really make search engines a ubiquitous
information-seeking tool. The World Wide Web has a rich structure: it contains both
textual web documents and the hyperlinks that connect them. The web documents
and hyperlinks between them form a directed graph in which the web documents can
be viewed as vertices and the hyperlinks as directed edges. Algorithms have been de-
veloped utilizing this directed graph to extract information contained in a collection
of hyperlinked web documents. Kleinberg proposed HITS algorithm based purely on
hyperlink information to retrieve the most relevant information: authority and hub
documents for a user query [20]. However, if the hypertext collection consists of sev-
eral topics, authority and hub documents may only cover the most popular topics and
leave out the less popular ones. One way to remedy this situation is to first partition
the hypertext collection into topical groups, and present the search results as a list
of topics to the user. This leads to the need to cluster web documents based on both
the textual and hyperlink information.
There exists a large literature on clustering methods and algorithms [13, 19]. Gen-
erally speaking, the purpose of cluster analysis is to organize the data into meaningful
groups: the data objects in the same group are highly similar and those in different
groups are dissimilar. Judging the effectiveness of a clustering algorithm is difficult
and usually application-dependent. In this paper, we apply a similarity-based cluster-
ing method to the problem of clustering web documents. It utilizes a graph-theoretic
criterion called normalized cut which has its root in the study of graph isoperimetric
* Department of Computer Science and Engineering, The Pennsylvania State University, Uni-
versity Park, PA 16802, {xhe,zha}fcse.psu.edu. This work was supported in part by NSF grant
CCR-9901986.
t NERSC Division, Lawrence Berkeley National Laboratory, University of California, Berkeley,
CA 94720, {xfhe,chqding,hdsimon}Clbl.gov. Supported by Department of Energy through an LBL
LDRD fund.
Upcoming Pages
Here’s what’s next.
Search Inside
This report can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Report.
He, Xiaofeng; Zha, Hongyuan; Ding, Chris H.Q & Simon, Horst D. Web document clustering using hyperlink structures, report, May 7, 2001; Berkeley, California. (https://digital.library.unt.edu/ark:/67531/metadc735609/m1/1/: accessed March 18, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT Libraries Government Documents Department.