In many domains of information retrieval, system estimates of document relevance are based on multidimensional quality criteria that have to be accommodated in a unidimensional result ranking. Current solutions to this challenge are often inconsistent with the formal probabilistic framework in which constituent scores were estimated, or use sophisticated learning methods that make it difficult for humans to understand the origin of the final ranking. To address these issues, we introduce the use of copulas, a powerful statistical framework for modeling complex multi-dimensional dependencies, to information retrieval tasks. We provide a formal background to copulas and demonstrate their effectiveness on standard IR tasks such as combining multidimensional relevance estimates and fusion of results from multiple search engines. We introduce copula-based versions of standard relevance estimators and fusion methods and show that these lead to significant performance improvements on several tasks, as evaluated on large-scale standard corpora, compared to their non-copula counterparts. We also investigate criteria for understanding the likely effect of using copula models in a given retrieval scenario.

This work together with Arjen P. de Vries and Kevyn Collins-Thompson has been accepted for full oral presentation at the 36th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR) in Dublin, Ireland.

On 26 April, the 13th edition of the Dutch-Belgian Information Retrieval Workshop series, DIR 2013, will be hosted at Delft University of Technology in the Netherlands. The workshop serves as a forum for exchange and discussion on relevant challenges in the fields of information retrieval, data mining and natural language processing. DIR invites novel previously unpublished work, compressed presentations of previous major international contributions, as well as demonstrations of applied research and industry applications.
Exploiting User Comments for Audio-visual Content Indexing and Retrieval

State-of-the-art content sharing platforms often require users to assign tags to pieces of media in order to make them easily retrievable. Since this task is sometimes perceived as tedious or boring, annotations can be sparse. Commenting on the other hand is a frequently used means of expressing user opinion towards shared media items. We propose the use of time series analyses in order to infer potential tags and indexing terms for audio-visual content from user comments. In this way, we mitigate the vocabulary gap between queries and document descriptors. Additionally, we show how large-scale encyclopedias such as Wikipedia can aid the task of tag prediction by serving as surrogates for high-coverage natural language vocabulary lists. Our evaluation is conducted on a corpus of several million real-world user comments from the popular video sharing platform YouTube, and demonstrates significant improvements in retrieval performance.

This work together with Wen Li and Arjen P. de Vries has been accepted for full oral presentation at the 35th European Conference on Information Retrieval (ECIR) in Moscow, Russia.

Designing Human-Readable User Profiles for Search Evaluation

Forming an accurate mental model of a user is crucial for the qualitative design and evaluation steps of many information-centric applications such as web search, content recommendation, or advertising. This process can often be time-consuming as search and interaction histories become verbose. We present and analyze the usefulness of concise human-readable user profiles in order to enhance system tuning and evaluation by means of user studies.

This work together with Kevyn Collins-Thompson, Paul Bennett and Susan Dumais has been accepted for poster presentation at the 35th European Conference on Information Retrieval (ECIR) in Moscow, Russia.

Personalizing Atypical Web Search Sessions

State-of-the-art web search personalization treats users as static or slowly evolving entities with a given set of preferences defined by their past behavior. However, recent publications as well as empirical evidence suggest that there is a significant number of search sessions in which users diverge from their regular search profiles in order to satisfy atypical, non-recurring information needs. In this work, we conduct a large-scale inspection of real life search sessions to further the understanding of this problem. Subsequently, we design an automatic means of detecting and supporting such atypical sessions. We demonstrate significant improvements over state-of-the-art web search personalization techniques by accounting for the typicality of search sessions. The merit of the proposed method is evaluated based on web-scale search session data spanning several months of user activity.

This work together with Kevyn Collins-Thompson, Paul Bennett and Susan Dumais has been accepted for full oral presentation at the ACM International Conference on Web Search and Data Mining (WSDM) in Rome, Italy.

The 35th ACM SIGIR Conference was held in Portland, Oregon, USA. Every three years the Gerard Salton Award is handed out for long lasting achievements in the field of information retrieval. This year, Prof. Dr. Norbert Fuhr was awarded with the prize.

The Downside of Markup: Examining the Harmful Effects of CSS and Javascript on Indexing Today’s Web

The continued development and maturation of advanced HTML features such as Cascading style sheets (css), js, and AJAX, as well as their widespread adoption by browsers, has enabled web pages to flourish with sophistication and interactivity. Unfortunately, this presents challenges to the web search community, as a web page’s representation in the browser (i.e., what users see) can diverge dramatically from its raw HTML content (i.e., what search engines index and retrieve). For example, interactive pages may contain content in regions that are not visible before a user action, such as focusing a tab, but which are nonetheless still contained within the raw HTML. We study this divergence by comparing raw HTML to its fully rendered form across a number of metrics spanning presentation, geometry, and content, using a large, representative sample of popular web pages. We find that a large divergence currently exists, and we show via a historical analysis that this divergence has grown more pronounced over the last decade. Finally, we conduct a retrieval experiment which shows that this divergence is already influencing web retrieval in a negative manner, and that we can improve performance by making use of properties that are only available via pages’ rendered forms. The general finding of our study is that continuing to index the web via simple HTML parsing will diminish the effectiveness of retrieval on the modern web.

This paper has been accepted for publication at CIKM’12, Maui, USA.

During the past years, the Web culture has grown more and more enticing, centring many services around social media and collaboratively shared content. The vast range of possible exploitations of such community platforms includes viral marketing, collaborative tagging, recommendation and content creation. BooksOnline’12 aims to offer a forum for bringing together expertise from academia, industry, libraries and archives to facilitate the exchange of research and application of social media and collaboratively shared content in the field of digital libraries with specific focus on online books. In particular, the impact and social use of this technology on younger users, so called Native Digital, is of great interest for a number of stakeholders from DL researchers to educators and publishers. The focus of this year’s workshop will thus be on how to make engaging reading experiences that readers would want to share.

BooksOnline’12 will encourage strong exploitation of the incentives and benefits of these major forms of massive on-line collaborations for digital libraries.

Workshop Format

The one day workshop will include selected oral and poster sessions to present and discuss ongoing research efforts, and a break-out session to brainstorm around new ideas, research directions, proposals and implementation strategies, finishing with presentations to summarize the results of the break-out sessions
Similarly to previous years, we plan to host keynote speakers, who are prominent in the area. Previous keynote speakers included Adam Farquhar (The British Library), Ville Miettinnen (Microtask), James Crawford (Google Books), John Ockerbloom (University of Pennsylvania), and Brewster Kahle (Internet Archive).

