For ambiguous, underspecified queries, retrieval systems rely on result set diversification techniques in order to ensure an adequate coverage of underlying topics such that the average user will find at least one of the returned documents useful. Previous attempts at result set diversification employed computationally expensive analyses of document content and query intent. In this paper, we instead rely on the inherent structure of the Web graph. Drawing from the locally dense distribution of similar topics across the hyperlink graph, we cast the diversification problem as optimizing coverage of the Web graph. In order to reduce the computational burden, we rely on modern sketching techniques to obtain highly efficient yet accurate approximate solutions. Our experiments on a snapshot of Wikipedia as well as the ClueWeb’12 dataset show ranking performance and execution times competitive with the state of the art at dramatically reduced memory requirements.
This paper has been accepted for presentation at the ACM CIKM Workshop on Big Network Analytics.