A Cross-Platform Collection of Social Network Profiles

The proliferation of Internet-enabled devices and services has led to a shifting balance between digital and analogue aspects of our everyday lives. In the face of this development there is a growing demand for the study of privacy hazards, the potential for unique user de-anonymization and information leakage between the various social media profiles many of us maintain. To enable the structured study of such adversarial effects, this paper presents a dedicated dataset of cross-platform social network personas (i.e., the same person has accounts on multiple platforms). The corpus comprises 850 users who generate predominantly English content. Each user entry contains the online footprint of the same natural person in three distinct social networks: Twitter, Instagram and Foursquare. In total, it encompasses over 2.5M tweets, 350k check-ins and 42k Instagram posts. We describe the collection methodology, characteristics of the dataset, and how to obtain it. Finally, we discuss one common use case, cross-platform user identification.

The dataset can be obtained in the data section and its description has been accepted for presentation in ACM SIGIR 2016.

Probabilistic Bag-of-Hyperlinks Models for Entity Linking

The goal of entity linking is to map spans of text to canonical entity representations such as Freebase entries or Wikipedia articles. It provides a foundation for various natural language processing tasks, including text understanding, summarization and machine translation. Name ambiguity, word polysemy, context dependencies, and a heavy-tailed distribution of entities contribute to the complexity of this problem. We here propose a probabilistic approach that makes use of an effective graphical model for collective entity linking, which resolves entity links jointly across an entire document. Our model captures local information from linkable token spans (i.e., mentions) and their surrounding context and combines it with a document-level prior of entity co-occurrences. The model is acquired automatically from entity-linked text repositories with a lightweight computational step for parameter adaptation. Loopy belief propagation is then used as an efficient approximate inference algorithm. Our method does not require extensive feature engineering but relies on simple sufficient statistics extracted from data, thus making it sufficiently fast for real-time usage. We demonstrate its performance on a wide range of well-known entity linking benchmark datasets, demonstrating that our approach matches, and in many cases outperforms, existing state-of-the-art methods.

This paper has been accepted for presentation at the World Wide Web Conference (WWW) 2016 in Montreal, Canada.

Probabilistic Local Expert Retrieval

This paper proposes a range of probabilistic models of local expertise based on geo-tagged social network streams. We assume that frequent visits result in greater familiarity with the location in question. To capture this notion, we rely on spatio-temporal information from users’ online check-in profiles. We evaluate the proposed models on a large-scale sample of geo-tagged and manually annotated Twitter streams. Our experiments show that the proposed methods outperform both intuitive baselines as well as established models such as the Iterative Inference scheme.

This paper has been accepted for presentation at ECIR 2016.

CIKM 2015 – Melbourne, Australia

My personal highlights among the oral paper presentations:

  • Chia-Jung Lee et al.An Optimization Framework for Merging Multiple Result Lists. The authors present a neural network-based approach to learning optimal result list fusion parameters for federated search.
  • David Maxwell et al.Searching and Stopping: An Analysis of Stopping Rules and Strategies. The authors investigate different models of search session termination, aiming to determine the point at which the user stops scanning the result list. To this end, they rely on behavioral theories of frustration and disgust.
  • Alessandro Sordoni et al.A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion. This paper describes the use of hierarchical de-/encoders for query suggestion. Generating a word at a time, the method aims at suggesting contextualised query candidates while ensuring robustness to candidate frequency, making it an interesting option for tail information needs.
  • Tom Kenter et al.Ad Hoc Monitoring of Vocabulary Shifts over Time. The authors describe a distributional semantics approach to characterizing transient word meanings over time. Relying on semantics-preserving word embeddings, they are able to track changing term interpretations as well as changing terminology for the same concept as language and society evolve.
  • Daan Odijk et al.Struggling and Success in Web Search. (Best Student Paper) The paper describes a large-scale empirical study of search success as well as struggles in finding the desired content. The experiment leads to the development of a number of practical techniques for forecasting future user actions, ultimately allowing to support those users with systematic search strategy deficiencies.

SIGIR 2015, Santiago, Chile

My personal highlights among the oral paper presentations:

  • Bhaskar MitraExploring Session Context using Distributed Representations of Queries and Reformulations. The authors rely on convolutional neural networks in order to learn semantically similar query reformulation patterns. Each observed reformulation from the log is mapped into the vector space in order to group and forecast reformulations and, subsequently, improve query auto completion accuracy.
  • Christina Lioma, Jakob Grue Simonsen, Birger Larsen and Niels Dalum HansenNon-Compositional Term Dependence for Information Retrieval. The authors tackle the challenge of estimating term dependencies by means of Markov random fields based on the notion of term compositionality, following the intuition that non-compositional terms show maximal dependence. In this way, they present an alternative to the popular co-occurrence based dependency estimation schemes.
  • Diane Kelly and Leif AzzopardiHow many Results per Page? A Study of SERP Size, Search Behavior and User Experience. This paper studies the relationships among the number of results shown on a SERP, search behavior and user experience. The authors instrument the SERP, showing three, six or the standard ten organic links per page, investigating user experience as well as cognitive and physical workload.
  • Artem Grotov, Shimon Whiteson and Maarten de RijkeBayesian Ranker Comparison based on Historical User Interactions. Instead of relying live comparison of production and candidate rankers, e.g., in an interleaving fashion, the authors propose a Bayesian scheme for estimating performance metrics and confidence levels on the basis of historic interactions. In this way, risky in vivo experiments can be avoided.

Exploiting Document Content for Efficient Aggregation of Crowdsourcing Votes

The use of crowdsourcing for document relevance assessment has been found to be a viable alternative to corpus annotation by highly trained experts. The question of quality control is a recurring challenge that is often addressed by aggregating multiple individual assessments of the same topic-document pair from independent workers. In the past, such aggregation schemes have been weighted or filtered by estimates of worker reliability based on a multitude of behavioral features. We propose an alternative approach by relying on document information. Inspired by the clustering hypothesis, we assume textually similar documents to show similar degrees of relevance towards a given topic. Following up on this intuition, we propagate crowd-generated relevance judgments to similar documents, effectively smoothing the distribution of relevance labels across the similarity space.

Our experiments are based on TREC Crowdsourcing Track data and show that even simple aggregation methods utilizing document similarity information significantly improve over majority voting in terms of accuracy as well as cost efficiency. Combining methods for both aggregation and active learning based on document information improves the results even further.

This paper has been accepted for presentation at the 24th ACM Conference on Information and Knowledge Management (CIKM) in Melbourne, Australia.

An Eye-Tracking Study of Query Reformulation

Information about a user’s domain knowledge and interest can be important signals for many information retrieval tasks such as query suggestion or result ranking. State-of-the-art user models rely on coarse-grained representations of the user’s previous knowledge about a topic or domain. We study query refinement using eye-tracking in order to gain precise and detailed insight into which terms the user was exposed to in a search session and which ones they showed a particular interest in. We measure fixations on the term level, allowing for a detailed model of user attention. To allow for a wide-spread exploitation of our findings, we generalize from the restrictive eye-gaze tracking to using more accessible signals: mouse cursor traces. Based on the public API of a popular search engine, we demonstrate how query suggestion candidates can be ranked according to traces of user attention and interest, resulting in significantly better performance than achieved by an attention-oblivious industry solution. Our experiments suggest that modelling term-level user attention can be achieved with great reliability and holds significant potential for supporting a range of traditional IR tasks.

The full version of this work has been accepted for presentation at the 38th Annual ACM SIGIR Conference in Santiago, Chile.


Modelling Term Dependence with Copulas

Many generative language and relevance models assume conditional independence between the likelihood of observing individual terms. This assumption is obviously naive, but also hard to replace or relax. There are only very few term pairs that actually show significant conditional dependencies while the vast majority of co-located terms has no implications on the document’s topical nature or relevance towards a given topic. It is exactly this situation that we capture in a formal framework: A limited number of meaningful dependencies in a system of largely independent observations. Making use of the formal copula framework, we describe the strength of causal dependency in terms of a number of established term co-occurrence metrics. Our experiments based on the well known ClueWeb’12 corpus and TREC 2013 topics indicate significant performance gains in terms of retrieval performance when we formally account for the dependency structure underlying pieces of natural language text.

The full version of this work has been accepted for presentation at the 38th Annual ACM SIGIR Conference in Santiago, Chile.

ECIR 2015 in Vienna, Austria

These are my personal highlights of the oral paper presentations:

  • Morgan Harvey and Fabio Crestani Long Time, No Tweets! Time-aware Personalised Hashtag Suggestion. The authors recommend hashtag candidates for tweets in order to increase retrieveability and organization of content in a microblogging environment. In particular, their method is based on temporal distribution patterns of tags observed in the training data.
  • Matthias Hagen et al.  A Corpus of Realistic Known-Item Topics with Associated Web Pages in the ClueWeb09. The authors present a collection of textual documents relating to the task of known item retrieval. Their selection was created sampling questions from Yahoo Answers that were satisfied by resources in the ClueWeb’09 Web page corpus. As an aside, the authors annotate cases of false memories in which users’ original requests are misleading and needed substantial reformulation aid from the Q&A community.
  • Grace Yang et al.  Designing States, Actions, and Rewards for Using POMDP in Session Search. The authors present a model of user behaviour in search sessions based on reinforcement learning. In particular, they rely on Partially Observable Markov Decision Processes to capture the relevant components of the search process.
  • Horatiu Bota et al.  Exploring Composite Retrieval from the Users’ Perspective. (Best Paper) The authors study the emerging task of composite retrieval in which semantically related results from different content verticals are presented in so-called bundles. Based on an empirical study, they investigate bundle relevance, coherence and diversity.