The “Where” in the Tweet

Twitter is a widely-used social networking service which enables its users to post short text-based messages, so-called tweets. POI (Point of Interest) tags on tweets can show more human-readable high-level information about a place that is more meaningful and better interpretable than a pair of coordinates. We studied the prediction of POI tags based on a tweet’s textual content and time of posting. Potential applications include accurate positioning when GPS devices fail or disambiguating places located near each other. We consider this task as a ranking problem, i.e., we rank a set of candidate POIs according to a tweet by using statistical models of language use and temporal distribution of tweets. To tackle the sparsity of tweets tagged with POIs, we use web pages retrieved by search engines as an additional source of evidence. Our experiments show that tweets indeed have relationships with their places of origin in both textual and temporal dimensions.

This initial exploratory study will be presented as a poster at the 20th ACM International Conference on Information and Knowledge Management (CIKM) in Glasgow, UK.

How much Spam can you take?

Crowdsourcing is frequently used to obtain relevance judgments for query/document pairs. To get accurate judgments, each pair is judged by several workers. Consensus is usually determined by majority voting and malicious submissions are typically countered by injecting gold set questions with known answers. We put the performance of gold sets and majority voting to the test. After an analysis of crowdsourcing results for a relevance judgment task, we design and evaluate an alternative method to reduce spam and increase accuracy. By using large-scale simulations, we compare performance between different algorithms, inspecting accuracy and costs for different experimental settings. The results show that gold sets and majority voting are less robust to malicious submissions than many believe and can easily be outperformed.

The full study has been accepted for publication in the SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval (CIR) in Beijing, China.