Web document clustering using hyperlink structures Page: 3 of 22
This report is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided to UNT Digital Library by the UNT Libraries Government Documents Department.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
Web Document Clustering
unable to solve this problem. The authoritative documents returned to users by HITS
may represent only the most popular topic. Documents related to under-represented
topics have less chance to be returned to the users while perhaps they are what the
users are expecting. To overcome this problem, before ranking the web documents
and presenting them to the user, it's necessary to group the retrieved web documents
first, then return to the users the ranked documents for each group according to their
relevance.
The hyperlink structure of the World Wide Web provides us with rich information
on web communities. It contains a lot of latent human annotation of the web society.
This motivates us to cluster the web documents by partitioning the web link graph.
The web documents retrieved by the search engine are a mixture of high and low
qualities written by individuals of different backgrounds. To reduce the influence of the
low quality web documents and modulate other documents based on their relevance
during the partitioning, the text information needs to be incorporated into the link
graph as a factor of edge weight. The co-citation is another relevance measure between
two web documents based on how many other web documents create hyperlinks to
both of them. To further improve the quality of the clustering results, combining the
co-citation information into the link graph is also necessary.
3. Normalized cut, Cheeger constant and spectral graph partitioning.
In graph theory, a set of edges that separates the graph into two disconnected parts is
called an edge-cut of the graph. The edge-cut alone may cause unbalanced partition:
it tends to cut a portion of a graph with small number of vertices. In the context of
graph clustering, this is in general not desirable. Various graph partitioning criteria
have been proposed to measure the quality of the partitioning. In this section, we
concentrate on Cheeger constant and normalized cut, and we will also discuss their
relationship.
3.1. Cheeger constant and Normalized cut. To avoid partitioning out a
small part of a graph G by using edge-cut alone, many criteria utilize various nor-
malized forms of edge-cut which are generally obtained by dividing the edge-cut by
some measure of the size of the partitions. Given a graph G = (V, E) with edge
weight matrix W, let S be a subset of V and S = V - S, the complement of S in V.
One normalized form of edge-cut measuring the quality of the graph partition is the
Cheeger constant. It was studied by J. Cheeger in early 1970s [7].
Define
hG(S = cut(S, S)
hG (SS)
min(asso(S, V), asso(S, V))
where cut(S, S) = Eics,jcs W~j,asso(S, V) = Eiesjev W2. The Cheeger constant
hG of graph G is defined as:
hG = min hG(S)
S
Assuming asso(S, V) < asso(S, V), then
cut(S, V) = hG(S) - min(asso(S, V), asso(S, V))
> hG - asso(S,V).
The problem of determining the Cheeger constant is equivalent to solving the above
graph partition problem.3
Upcoming Pages
Here’s what’s next.
Search Inside
This report can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Report.
He, Xiaofeng; Zha, Hongyuan; Ding, Chris H.Q & Simon, Horst D. Web document clustering using hyperlink structures, report, May 7, 2001; Berkeley, California. (https://digital.library.unt.edu/ark:/67531/metadc735609/m1/3/: accessed March 18, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT Libraries Government Documents Department.