Do user preferences and evaluation measures line up?

Sanderson, Mark; Paramita, Monica Lestari; Clough, Paul; Kanoulas, Evangelos

doi:10.1145/1835449.1835542

Cited by 114 publications

(89 citation statements)

References 26 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Users preferred results 1-10. Sanderson et al [11] used a similar interface with Mechanical Turk to validate a set of testcollection-based metrics. NDCG agreed the most with user preferences (63% agreement overall and 82% for navigational queries).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A Methodology for Evaluating Aggregated Search Results

Arguello

Dı́az

Callan

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Aggregated search is the task of incorporating results from different specialized search services, or verticals, into Web search results. While most prior work focuses on deciding which verticals to present, the task of deciding where in the Web results to embed the vertical results has received less attention. We propose a methodology for evaluating an aggregated set of results. Our method elicits a relatively small number of human judgements for a given query and then uses these to facilitate a metric-based evaluation of any possible presentation for the query. An extensive user study with 13 verticals confirms that, when users prefer one presentation of results over another, our metric agrees with the stated preference. By using Amazon's Mechanical Turk, we show that reliable assessments can be obtained quickly and inexpensively.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Following Sanderson et al [11], quality control was done by including 150 "trap" HITs (a Human Intelligence Task is a task associated with AMT). Each trap HIT consisted of a triplet (q, i, j) where either i or j was taken from a query other than q.…”

Section: Preference Judgements On Block-pairsmentioning

confidence: 99%

A Methodology for Evaluating Aggregated Search Results

Arguello

Dı́az

Callan

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Research is ongoing to tackle a range of issues in information retrieval evaluation using test collections. For example, gathering relevance assessments efficiently (see Section 3.4), comparing system effectiveness and user utility (Hersh et al, 2000a;Sanderson et al, 2010); evaluating information retrieval systems over sessions rather than single queries (Kanoulas et al, 2011), the use of simulations (Azzopardi et al, 2010), and the development of new information retrieval evaluation measures (Yilmaz et al, 2010;Smucker and Clarke, 2012). Further information about the practical construction of test collections can be found in (Sanderson, 2010;Clough and Sanderson, 2013).…”

Section: Evaluation Using Test Collectionsmentioning

confidence: 99%

An overview of semantic search evaluation initiatives

Elbedweihy

Wrigley

Clough

et al. 2015

Journal of Web Semantics

Self Cite

View full text Add to dashboard Cite

“…Using the Amazon Mechanical Turk framework and the TREC 2009 Web diversity test collection with binary relevance assessments, Sanderson et al [26] examined the predictive power of diversity metrics such as α-nDCG: if a metric prefers one ranked list over another, does the user also prefer the same list? While our concordance test for quantifying the "relative intuitiveness" of diversity metrics was partially inspired by the side-by-side approach of Sanderson et al, their work and ours fundamentally differ in the following aspects: (1) While Sanderson et al treated each subtopic (i.e., intent) as an independent topic to examine the relationship between user preferences and metric preferences, we aim to measure the intuitiveness of metrics with respect to the entire (ambiguous or underspecified) topic in terms of diversity and relevance; (2) While Sanderson et al used the Mechanical Turkers, we use very simple evaluation metrics that represent diversity or relevance as the gold standard in order to quantify intuitiveness.…”

Section: Comparing Diversity Metricsmentioning

confidence: 99%

Web Search Evaluation with Informational and Navigational Intents

Sakai

2013

Journal of Information Processing

View full text Add to dashboard Cite

Given an ambiguous or underspecified web search query, search result diversification aims at accomodating different user intents within a single "entry-point" result page. However, some intents are informational, for which many relevant pages may help, while others are navigational, for which only one web page is required. We propose new evaluation metrics for search result diversification that considers this distinction, as well as the condordance test for comparing the intuitiveness of a given pair of metrics quantitatively. Our main experimental findings are: (a) In terms of discriminative power which reflects statistical reliability, the proposed metrics, DIN -nDCG and P+Q , are comparable to intent recall and D -nDCG, and possibly superior to α-nDCG; (b) In terms of the concordance test which quantifies the agreement of a diversity metric with a gold standard metric that represents a basic desirable property, DIN -nDCG is superior to other diversity metrics in its ability to reward both diversity and relevance at the same time. Moreover, both D -nDCG and DIN -nDCG significantly outperform α-nDCG in their ability to reward diversity, to reward relevance, and to reward both at the same time. In addition, we demonstrate that the randomised Tukey's Honestly Significant Differences test that takes the entire set of available runs into account is substantially more conservative than the paired bootstrap test that only considers one run pair at a time, and therefore recommend the former approach for significance testing when a set of runs is available for evaluation.

show abstract

Do user preferences and evaluation measures line up?

Cited by 114 publications

References 26 publications

A Methodology for Evaluating Aggregated Search Results

A Methodology for Evaluating Aggregated Search Results

An overview of semantic search evaluation initiatives

Web Search Evaluation with Informational and Navigational Intents

Contact Info

Product

Resources

About