All posts by carsten

Ranking and Feedback-based Stopping for Recall-centric Document Retrieval

Medical systematic reviews require researchers to identify the entire body of relevant literature. Algorithms that filter the list for manual scanning with nearly perfect recall can significantly decrease the workload. This paper presents a novel stopping criterion that estimates the score-distribution of relevant articles from relevance feedback of random articles (S-D Minimal Sampling). Using 20 training and 30 test topics, we achieve a mean recall of 93.3\%, filtering out 59.1\% of the articles. This approach achieves higher F2-Scores at significantly reduced manual reviewing work loads. The method is especially suited for scenarios with sufficiently many relevant articles (> 5) that can be sampled and employed for relevance feedback.

To appear in Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF), Dublin, Ireland, 2017

Neural Document Embeddings for Intensive Care Patient Mortality Prediction

The steadily growing amount of digitized clinical data such as health records, scholarly medical literature, systematic reviews of substances and procedures, or descriptions of clinical trials holds significant potential for exploitation by automatic inference and data mining techniques. Besides the wide range of clinical research questions such as drug-to-drug interactions or quantitative population studies of disease properties, there is a rich potential for applying data-driven methods in daily clinical practice for key tasks such as decision support or patient mortality prediction. The latter task is especially important in clinical practice when prioritizing allocation of scarce resources or determining the frequency and intensity of post-discharge care.

We present an automatic mortality prediction scheme based on the unstructured textual content of clinical notes. Proposing a convolutional document embedding approach, our empirical investigation using the MIMIC-III intensive care database shows significant performance gains compared to previously employed methods such as latent topic distributions or generic doc2vec embeddings. These improvements are especially pronounced for the difficult problem of post-discharge mortality prediction.

This paper has been accepted for presentation at the NIPS 2016 Machine Learning for Health Workshop and has won a best abstract award at the Artificial Intelligence in Medicine Symposium.

CIKM 2016 – Indianapolis

My personal highlights among the full paper presentations:

As well as some promising short papers:

Efficient Web Search Diversification via Approximate Graph Coverage

For ambiguous, underspecified queries, retrieval systems rely on result set diversification techniques in order to ensure an adequate coverage of underlying topics such that the average user will find at least one of the returned documents useful. Previous attempts at result set diversification employed computationally expensive analyses of document content and query intent. In this paper, we instead rely on the inherent structure of the Web graph. Drawing from the locally dense distribution of similar topics across the hyperlink graph, we cast the diversification problem as optimizing coverage of the Web graph. In order to reduce the computational burden, we rely on modern sketching techniques to obtain highly efficient yet accurate approximate solutions. Our experiments on a snapshot of Wikipedia as well as the ClueWeb’12 dataset show ranking performance and execution times competitive with the state of the art at dramatically reduced memory requirements.

This paper has been accepted for presentation at the ACM CIKM Workshop on Big Network Analytics.

Active Content-Based Crowdsourcing Task Selection

Crowdsourcing has long established itself as a viable alternative to corpus annotation by domain experts for tasks such as document relevance assessment. The crowdsourcing process traditionally relies on high degrees of label redundancy in order to mitigate the detrimental effects of individually noisy worker submissions. Such redundancy comes at the cost of increased label volume, and, subsequently, monetary requirements. In practice, especially as the size of datasets increases, this is undesirable.
In this paper, we focus on an alternate method that exploits document information instead, to infer relevance labels for unjudged documents. We present an active learning scheme for document selection that aims at maximising the overall relevance label prediction accuracy, for a given budget of available relevance judgements by exploiting system-wide estimates of label variance and mutual information. Our experiments are based on TREC 2011 Crowdsourcing Track data and show that our method is able to achieve state-of-the-art performance while requiring 17 – 25% less budget.

This paper has been accepted for presentation at the 25th ACM International Conference on Information and Knowledge Management (CIKM).

SIGIR 2016 – Pisa, Italy

My personal highlights among the full paper presentations:

As well as some promising short papers:

  • Tetsuya Sakai
    Two Sample T-tests for IR Evaluation: Student or Welch?
    The author contests the widely-accepted notion that a Welch t-test be unconditionally preferable over a two-sample Student test. The investigation concludes that, if sample sizes differ substantially, and if the larger sample has a substantially larger variance, Welch’s t-test may not be reliable.
  • Bevan Koopman et al.
    A Test Collection for Matching Patients to Clinical Trials
    . The authors annotated 60 existing TREC CDS patient descriptions in terms of their eligibility for participation in a wide range of publicly advertised clinical trials.
  • Sumit Sidana et al.
    Health Monitoring on Social Media over Time
    . Using a spatio-temporal topic modeling approach, the authors investigate which medical conditions people manifest on social media at different geographical locations as well as points in time (e.g., throughout the seasons).

Implicit Negative Feedback in Clinical Information Retrieval

In this work, we reflect on ways to improve medical information retrieval accuracy by drawing implicit negative feedback from negated information in noisy natural language search queries. We begin by studying the extent to which negations occur in clinical texts and quantify their detrimental effect on retrieval performance. Subsequently, we present approaches to query reformulation and ranking that remedy these shortcomings by resolving natural language negations. Our experimental results are based on data collected in the course of the TREC Clinical Decision Support Track and show consistent improvements compared to state-of-the-art methods. Using one of our novel algorithms, we are able to alleviate the negative impact of negations on early precision.

This paper has been accepted for presentation at the ACM SIGIR Medical Information Retrieval Workshop (MedIR) in Pisa, Italy.

Privacy Leakage through Innocent Content Sharing in Online Social Networks

The increased popularity and ubiquitous availability of online social networks and globalised Internet access have affected the way in which people share content. The information that users willingly share in these platforms can be used for various purposes, from building consumer models for advertising, to inferring personal, potentially invasive, information.
In this work, we use Twitter, Instagram and Foursquare data to convey the idea that the content shared by users, especially when aggregated across platforms, can potentially disclose more information than was originally intended.
We perform two case studies: First, we perform user de-anonymization by mimicking the scenario of finding the identity of a user making anonymous posts within a group of users. Empirical evaluation on a sample of real-world social network profiles suggests that cross platform aggregation introduces significant performance gains in user identification.
In the second task, we show that it is possible to infer physical location visits of a user on the basis of shared Twitter and Instagram content. We present an informativeness scoring function which estimates the relevance and novelty of a shared piece of information with respect to an inference task. This measure is validated using an active learning framework which chooses the most informative content at each given point in time. Based on a large-scale data sample, we show that by doing this, we can attain an improved inference performance. In some cases this performance exceeds even the use of the user’s full timeline.

This paper has been accepted for presentation at the ACM SIGIR Workshop on Privacy-Preserving Information Retrieval (PIR) in Pisa, Italy.

Retrieval Techniques for Contextual Learning

Following constructivist models of contextual learning, knowledge acquisition goes beyond mere absorption of isolated facts, and, instead is enabled, stimulated and supported by related existing knowledge and experiences. We discuss a range of query expansion and result list re-ranking techniques aiming to preserve contextual dependencies among retrieved documents and, thereby, enhancing the performance of learning-centric search engines. Our empirical evaluation is based on a snapshot of Wikipedia and suggests significantly increased usability during an interactive user study.

This paper has been accepted for presentation at the ACM SIGIR Search as Learning Workshop (SAL) in Pisa, Italy.

Efficient Parallel Learning of Word2Vec

Since its introduction, Word2Vec and its variants are widely used to learn semantics-preserving representations of words or entities in an embedding space which can be used to produce state-of-art results for various Natural Language Processing tasks. Existing implementations aim to learn efficiently by running multiple threads in parallel while operating on a single model in shared memory, ignoring incidental memory update collisions. We show that these collisions can degrade the efficiency of parallel learning, and propose a straightforward caching strategy that improves the efficiency by a factor of 4.

This paper has been accepted for presentation at the ICML Machine Learning Systems Workshop in New York City, USA.