Category Archives: Paper Abstract

Ranking and Feedback-based Stopping for Recall-centric Document Retrieval

Medical systematic reviews require researchers to identify the entire body of relevant literature. Algorithms that filter the list for manual scanning with nearly perfect recall can significantly decrease the workload. This paper presents a novel stopping criterion that estimates the score-distribution of relevant articles from relevance feedback of random articles (S-D Minimal Sampling). Using 20 training and 30 test topics, we achieve a mean recall of 93.3%, filtering out 59.1% of the articles. This approach achieves higher F2-Scores at significantly reduced manual reviewing work loads. The method is especially suited for scenarios with sufficiently many relevant articles (> 5) that can be sampled and employed for relevance feedback.

To appear in Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF), Dublin, Ireland, 2017

Neural Document Embeddings for Intensive Care Patient Mortality Prediction

The steadily growing amount of digitized clinical data such as health records, scholarly medical literature, systematic reviews of substances and procedures, or descriptions of clinical trials holds significant potential for exploitation by automatic inference and data mining techniques. Besides the wide range of clinical research questions such as drug-to-drug interactions or quantitative population studies of disease properties, there is a rich potential for applying data-driven methods in daily clinical practice for key tasks such as decision support or patient mortality prediction. The latter task is especially important in clinical practice when prioritizing allocation of scarce resources or determining the frequency and intensity of post-discharge care.

We present an automatic mortality prediction scheme based on the unstructured textual content of clinical notes. Proposing a convolutional document embedding approach, our empirical investigation using the MIMIC-III intensive care database shows significant performance gains compared to previously employed methods such as latent topic distributions or generic doc2vec embeddings. These improvements are especially pronounced for the difficult problem of post-discharge mortality prediction.

This paper has been accepted for presentation at the NIPS 2016 Machine Learning for Health Workshop and has won a best abstract award at the Artificial Intelligence in Medicine Symposium.

Efficient Web Search Diversification via Approximate Graph Coverage

For ambiguous, underspecified queries, retrieval systems rely on result set diversification techniques in order to ensure an adequate coverage of underlying topics such that the average user will find at least one of the returned documents useful. Previous attempts at result set diversification employed computationally expensive analyses of document content and query intent. In this paper, we instead rely on the inherent structure of the Web graph. Drawing from the locally dense distribution of similar topics across the hyperlink graph, we cast the diversification problem as optimizing coverage of the Web graph. In order to reduce the computational burden, we rely on modern sketching techniques to obtain highly efficient yet accurate approximate solutions. Our experiments on a snapshot of Wikipedia as well as the ClueWeb’12 dataset show ranking performance and execution times competitive with the state of the art at dramatically reduced memory requirements.

This paper has been accepted for presentation at the ACM CIKM Workshop on Big Network Analytics.

Active Content-Based Crowdsourcing Task Selection

Crowdsourcing has long established itself as a viable alternative to corpus annotation by domain experts for tasks such as document relevance assessment. The crowdsourcing process traditionally relies on high degrees of label redundancy in order to mitigate the detrimental effects of individually noisy worker submissions. Such redundancy comes at the cost of increased label volume, and, subsequently, monetary requirements. In practice, especially as the size of datasets increases, this is undesirable.
In this paper, we focus on an alternate method that exploits document information instead, to infer relevance labels for unjudged documents. We present an active learning scheme for document selection that aims at maximising the overall relevance label prediction accuracy, for a given budget of available relevance judgements by exploiting system-wide estimates of label variance and mutual information. Our experiments are based on TREC 2011 Crowdsourcing Track data and show that our method is able to achieve state-of-the-art performance while requiring 17 – 25% less budget.

This paper has been accepted for presentation at the 25th ACM International Conference on Information and Knowledge Management (CIKM).

Implicit Negative Feedback in Clinical Information Retrieval

In this work, we reflect on ways to improve medical information retrieval accuracy by drawing implicit negative feedback from negated information in noisy natural language search queries. We begin by studying the extent to which negations occur in clinical texts and quantify their detrimental effect on retrieval performance. Subsequently, we present approaches to query reformulation and ranking that remedy these shortcomings by resolving natural language negations. Our experimental results are based on data collected in the course of the TREC Clinical Decision Support Track and show consistent improvements compared to state-of-the-art methods. Using one of our novel algorithms, we are able to alleviate the negative impact of negations on early precision.

This paper has been accepted for presentation at the ACM SIGIR Medical Information Retrieval Workshop (MedIR) in Pisa, Italy.

Privacy Leakage through Innocent Content Sharing in Online Social Networks

The increased popularity and ubiquitous availability of online social networks and globalised Internet access have affected the way in which people share content. The information that users willingly share in these platforms can be used for various purposes, from building consumer models for advertising, to inferring personal, potentially invasive, information.
In this work, we use Twitter, Instagram and Foursquare data to convey the idea that the content shared by users, especially when aggregated across platforms, can potentially disclose more information than was originally intended.
We perform two case studies: First, we perform user de-anonymization by mimicking the scenario of finding the identity of a user making anonymous posts within a group of users. Empirical evaluation on a sample of real-world social network profiles suggests that cross platform aggregation introduces significant performance gains in user identification.
In the second task, we show that it is possible to infer physical location visits of a user on the basis of shared Twitter and Instagram content. We present an informativeness scoring function which estimates the relevance and novelty of a shared piece of information with respect to an inference task. This measure is validated using an active learning framework which chooses the most informative content at each given point in time. Based on a large-scale data sample, we show that by doing this, we can attain an improved inference performance. In some cases this performance exceeds even the use of the user’s full timeline.

This paper has been accepted for presentation at the ACM SIGIR Workshop on Privacy-Preserving Information Retrieval (PIR) in Pisa, Italy.

Retrieval Techniques for Contextual Learning

Following constructivist models of contextual learning, knowledge acquisition goes beyond mere absorption of isolated facts, and, instead is enabled, stimulated and supported by related existing knowledge and experiences. We discuss a range of query expansion and result list re-ranking techniques aiming to preserve contextual dependencies among retrieved documents and, thereby, enhancing the performance of learning-centric search engines. Our empirical evaluation is based on a snapshot of Wikipedia and suggests significantly increased usability during an interactive user study.

This paper has been accepted for presentation at the ACM SIGIR Search as Learning Workshop (SAL) in Pisa, Italy.

Efficient Parallel Learning of Word2Vec

Since its introduction, Word2Vec and its variants are widely used to learn semantics-preserving representations of words or entities in an embedding space which can be used to produce state-of-art results for various Natural Language Processing tasks. Existing implementations aim to learn efficiently by running multiple threads in parallel while operating on a single model in shared memory, ignoring incidental memory update collisions. We show that these collisions can degrade the efficiency of parallel learning, and propose a straightforward caching strategy that improves the efficiency by a factor of 4.

This paper has been accepted for presentation at the ICML Machine Learning Systems Workshop in New York City, USA.

A Cross-Platform Collection of Social Network Profiles

The proliferation of Internet-enabled devices and services has led to a shifting balance between digital and analogue aspects of our everyday lives. In the face of this development there is a growing demand for the study of privacy hazards, the potential for unique user de-anonymization and information leakage between the various social media profiles many of us maintain. To enable the structured study of such adversarial effects, this paper presents a dedicated dataset of cross-platform social network personas (i.e., the same person has accounts on multiple platforms). The corpus comprises 850 users who generate predominantly English content. Each user entry contains the online footprint of the same natural person in three distinct social networks: Twitter, Instagram and Foursquare. In total, it encompasses over 2.5M tweets, 350k check-ins and 42k Instagram posts. We describe the collection methodology, characteristics of the dataset, and how to obtain it. Finally, we discuss one common use case, cross-platform user identification.

The dataset can be obtained in the data section and its description has been accepted for presentation in ACM SIGIR 2016.

Probabilistic Bag-of-Hyperlinks Models for Entity Linking

The goal of entity linking is to map spans of text to canonical entity representations such as Freebase entries or Wikipedia articles. It provides a foundation for various natural language processing tasks, including text understanding, summarization and machine translation. Name ambiguity, word polysemy, context dependencies, and a heavy-tailed distribution of entities contribute to the complexity of this problem. We here propose a probabilistic approach that makes use of an effective graphical model for collective entity linking, which resolves entity links jointly across an entire document. Our model captures local information from linkable token spans (i.e., mentions) and their surrounding context and combines it with a document-level prior of entity co-occurrences. The model is acquired automatically from entity-linked text repositories with a lightweight computational step for parameter adaptation. Loopy belief propagation is then used as an efficient approximate inference algorithm. Our method does not require extensive feature engineering but relies on simple sufficient statistics extracted from data, thus making it sufficiently fast for real-time usage. We demonstrate its performance on a wide range of well-known entity linking benchmark datasets, demonstrating that our approach matches, and in many cases outperforms, existing state-of-the-art methods.

This paper has been accepted for presentation at the World Wide Web Conference (WWW) 2016 in Montreal, Canada.