Many generative language and relevance models assume conditional independence between the likelihood of observing individual terms. This assumption is obviously naive, but also hard to replace or relax. There are only very few term pairs that actually show significant conditional dependencies while the vast majority of co-located terms has no implications on the document’s topical nature or relevance towards a given topic. It is exactly this situation that we capture in a formal framework: A limited number of meaningful dependencies in a system of largely independent observations. Making use of the formal copula framework, we describe the strength of causal dependency in terms of a number of established term co-occurrence metrics. Our experiments based on the well known ClueWeb’12 corpus and TREC 2013 topics indicate significant performance gains in terms of retrieval performance when we formally account for the dependency structure underlying pieces of natural language text.
The full version of this work has been accepted for presentation at the 38th Annual ACM SIGIR Conference in Santiago, Chile.