Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval 2010
DOI: 10.1145/1835449.1835542
|View full text |Cite
|
Sign up to set email alerts
|

Do user preferences and evaluation measures line up?

Abstract: This paper presents results comparing user preference for search engine rankings with measures of effectiveness computed from a test collection. It establishes that preferences and evaluation measures correlate: systems measured as better on a test collection are preferred by users. This correlation is established for both that emphasizes diverse results. The nDCG and ERR measures were found to correlate best with user preferences compared to a selection of other well known measures. Unlike previous studies in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
87
0

Year Published

2011
2011
2021
2021

Publication Types

Select...
5
4

Relationship

2
7

Authors

Journals

citations
Cited by 114 publications
(89 citation statements)
references
References 26 publications
(26 reference statements)
2
87
0
Order By: Relevance
“…Users preferred results 1-10. Sanderson et al [11] used a similar interface with Mechanical Turk to validate a set of testcollection-based metrics. NDCG agreed the most with user preferences (63% agreement overall and 82% for navigational queries).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Users preferred results 1-10. Sanderson et al [11] used a similar interface with Mechanical Turk to validate a set of testcollection-based metrics. NDCG agreed the most with user preferences (63% agreement overall and 82% for navigational queries).…”
Section: Related Workmentioning
confidence: 99%
“…Following Sanderson et al [11], quality control was done by including 150 "trap" HITs (a Human Intelligence Task is a task associated with AMT). Each trap HIT consisted of a triplet (q, i, j) where either i or j was taken from a query other than q.…”
Section: Preference Judgements On Block-pairsmentioning
confidence: 99%
“…Research is ongoing to tackle a range of issues in information retrieval evaluation using test collections. For example, gathering relevance assessments efficiently (see Section 3.4), comparing system effectiveness and user utility (Hersh et al, 2000a;Sanderson et al, 2010); evaluating information retrieval systems over sessions rather than single queries (Kanoulas et al, 2011), the use of simulations (Azzopardi et al, 2010), and the development of new information retrieval evaluation measures (Yilmaz et al, 2010;Smucker and Clarke, 2012). Further information about the practical construction of test collections can be found in (Sanderson, 2010;Clough and Sanderson, 2013).…”
Section: Evaluation Using Test Collectionsmentioning
confidence: 99%
“…Using the Amazon Mechanical Turk framework and the TREC 2009 Web diversity test collection with binary relevance assessments, Sanderson et al [26] examined the predictive power of diversity metrics such as α-nDCG: if a metric prefers one ranked list over another, does the user also prefer the same list? While our concordance test for quantifying the "relative intuitiveness" of diversity metrics was partially inspired by the side-by-side approach of Sanderson et al, their work and ours fundamentally differ in the following aspects: (1) While Sanderson et al treated each subtopic (i.e., intent) as an independent topic to examine the relationship between user preferences and metric preferences, we aim to measure the intuitiveness of metrics with respect to the entire (ambiguous or underspecified) topic in terms of diversity and relevance; (2) While Sanderson et al used the Mechanical Turkers, we use very simple evaluation metrics that represent diversity or relevance as the gold standard in order to quantify intuitiveness.…”
Section: Comparing Diversity Metricsmentioning
confidence: 99%