Svoboda | Graniru | BBC Russia | Golosameriki | Facebook

research-article

Good Evaluation Measures based on Document Preferences

Authors:

Zhaohao ZengAuthors Info & Claims

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 359 - 368

https://doi.org/10.1145/3397271.3401115

Published: 25 July 2020 Publication History

Abstract

For offline evaluation of IR systems, some researchers have proposed to utilise pairwise document preference assessments instead of relevance assessments of individual documents, as it may be easier for assessors to make relative decisions rather than absolute ones. Simple preference-based evaluation measures such as ppref and wpref have been proposed, but the past decade did not see any wide use of such measures. One reason for this may be that, while these new measures have been reported to behave more or less similarly to traditional measures based on absolute assessments, whether they actually align with the users' perception of search engine result pages (SERPs) has been unknown. The present study addresses exactly this question, after formally defining two classes of preference-based measures called Pref measures and Δ-measures. We show that the best of these measures perform at least as well as an average assessor in terms of agreement with users' SERP preferences, and that implicit document preferences (i.e., those suggested by a SERP that retrieves one document but not the other) play a much more important role than explicit preferences (i.e., those suggested by a SERP that retrieves one document above the other). We have released our data set containing 119,646 document preferences, so that the feasibility of document preferenced-based evaluation can be further pursued by the IR community.

Supplementary Material

MP4 File (3397271.3401115.mp4)

19-minute video presentation of "Good Evaluation Measures based on Document Preferences,"\r\na SIGIR 2020 full paper by Tetsuya Sakai and Zhaohao Zeng (Waseda University, Japan).

Download
212.55 MB

References

[1]

Rakesh Agrawal, Alan Halverson, Krishnaram Kenthapadi, Nina Mishra, and Panayiotis Tsaparas. 2009. Generating labels from clicks. In Proceedings of ACM WSDM 2009. 172--181.

Digital Library

[2]

Azzah Al-Maskari, Mark Sanderson, Paul Clough, and Eija Airio. 2008. The Good and the Bad System: Does the Test Collection Predict Users' Effectiveness. In Proceedings of ACM SIGIR 2018. 59--66.

[3]

Enrique Amigó, Damiano Spina, and Jorge Carrillo de Albornoz. 2018. An Axiomatic Analysis of Diversity Evaluation Metrics: Introducting the Rank-Biased Utility Metric. In Proceedings of ACM SIGIR 2018. 625--634.

[4]

Maryam Bashir, Jesse Anderton, Jie Wu, Peter B. Golbus, Virgil Pavlu, and Javed A. Aslam. 2013. A Document Rating System for Preference Judgements. In Proceedings of ACM SIGIR 2013. 909--912.

[5]

Ben Carterette and Paul N. Bennett. 2008. Evaluation Measures for Preference Judgments. In Proceedings of ACM SIGIR 2008. 685--686.

[6]

Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T. Dumais. 2008. Here or There: Preference Judgments for Relevance. In Proceedings of ECIR 2008 (LNCS 4956). 16--27.

[7]

Praveen Chandar and Ben Carterette. 2012. Using Preference Judgments for Novel Document Retrieval. In Proceedings of ACM SIGIR 2012. 861--870.

Digital Library

[8]

Praveen Chandar and Ben Carterette. 2013. Preference Based Evaluation Measures for Novelty and Diversity. In Proceedings of ACM SIGIR 2013. 413--422.

Digital Library

[9]

Olivier Chapelle, Donald Metzler, Ya Zhang, and Pierre Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. In Proceedings of ACM CIKM 2009. 621--630.

Digital Library

[10]

Zhicheng Dou, Ruihua Song, Xiaojie Yuan, and Ji-Rong Wen. 2008. Are Click-through Data Adequate for Learning Web Search Rankings?. In Proceedings of ACM CIKM 2008. 73--82.

Digital Library

[11]

H. P. Frei and P. Sch"auble. 1991. Determining the Effectiveness of Retrieval Algorithms. Information Processing and Management, Vol. 27, 2/3 (1991), 153--164.

Digital Library

[12]

Peter B. Golbus, Imed Zitouni, Jin Young Kim, Ahmed Hassan, and Fernando Diaz. 2014. Contextual and Dimensional Relevance Judgments for Reusable SERP-Level Evaluation. In Proceedings of WWW 2014. 131--142.

Digital Library

[13]

Jialong Han, Qinglei Wang, Naoki Orii, Zhicheng Dou, Tetsuya Sakai, and Ruihua Song. 2011. Microsoft Research Asia at the NTCIR-9 Intent Task. In Proceedings of NTCIR-9. 116--122.

[14]

William Hersh, Andrew Turpin, Susan Price, Benjamin Chen, Dale Kramer, Lynetta Sacherek, and Daniel Olson. 2000. Do Batch and User Evaluations Give the Same Results?. In Proceedings of ACM SIGIR 2000. 17--24.

Digital Library

[15]

Nadine Hoechstoetter and Dirk Lewandowski. 2015. What Users See - Structures in Search Engine Results Pages. arxiv: cs.IR/1511.05802

[16]

Katja Hofmann, Lihong Li, and Filip Radlinski. 2016. Online Evaluation for Information Retrieval. Foundations and Trends in Information Retrieval, Vol. 10, 1 (2016).

[17]

Kai Hui and Klaus Berberich. 2017a. Merge-Tie-Judge: Low-Cost Preference Judgments with Ties. In Proceedings of ACM ICTIR 2017. 277--280.

[18]

Kai Hui and Klaus Berberich. 2017b. Transitivity, Time Consumption, and Quality of Preference Judgements in Crowdsourcing. In Proceedings of ECIR 2017 (LNCS 10193). 239--251.

[19]

Kalervo Jírvelin and Jaana Kekílíainen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM Transactions on Information Systems, Vol. 20, 4 (2002), 422--446.

Digital Library

[20]

Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM TOIS, Vol. 25, 2 (2007).

Digital Library

[21]

Mami Kawasaki, Inho Kang, and Tetsuya Sakai. 2017. Ranking Rich Mobile Verticals based on Clicks and Abandonment. In Proceedings of ACM CIKM 2017. 2127--2130.

Digital Library

[22]

Gabriella Kazai, Emine Yilmaz, Nick Craswell, and S.M.M. Tahaghoghi. 2013. User Intent and Assessor Disagreement in Web Search Evaluation. In Proceedings of ACM CIKM 2013. 699--708.

Digital Library

[23]

Klaus Krippendorff. 2018. Content Analysis: An Introduction to Its Methodology (Fourth Edition). SAGE Publications.

[24]

Jeffrey D. Long and Norman Cliff. 1997. Confidence Intervals for Kendall's tau. Brit. J. Math. Statist. Psych., Vol. 50 (1997), 31--41.

[25]

Jiaxin Mao, Yiqun Liu, Ke Zhou, Jian-Yun Nie, Jingtao Song, Min Zhang, Shaoping Ma, Jiashen Sun, and Hengliang Luo. 2016. When Does Relevance Mean Usefulness and User Satisfaction in Web Search?. In Proceedings of ACM SIGIR 2016. 463--472.

Digital Library

[26]

Alistair Moffat, Peter Bailey, Falk Scholer, and Paul Thomas. 2017. Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness. ACM TOIS, Vol. 35, 3 (2017).

Digital Library

[27]

Alistair Moffat and Justin Zobel. 2008. Rank-Biased Precision for Measurement of Retrieval Effectiveness. ACM TOIS, Vol. 27, 1 (2008).

Digital Library

[28]

Kira Radinsky and Nir Ailon. 2011. Ranking from Pairs and Triplets: Information Quality, Evaluation Methods and Query Complexity. In Proceedings of ACM WSDM 2011. 105--114.

[29]

Tetsuya Sakai. 2007. Alternatives to Bpref. In Proceedings of ACM SIGIR 2007. 71--78.

Digital Library

[30]

Tetsuya Sakai. 2012. Evaluation with Informational and Navigational Intents. In Proceedings of WWW 2012. 499--508.

Digital Library

[31]

Tetsuya Sakai. 2014. Metrics, Statistics, Tests. In PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173). 116--163.

[32]

Tetsuya Sakai. 2018. Laboratory Experiments in Information Retrieval: Sample Sizes, Effect Sizes, and Statistical Power. Springer.

[33]

Tetsuya Sakai. 2019. How to Run an Evaluation Task: with a Primary Focus on Ad Hoc Information Retrieval. In Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF, Nicola Ferro and Carol Peters (Eds.). Springer, 71--102.

[34]

Tetsuya Sakai. 2020. On Fuhr's Guideline for IR Evaluation. SIGIR Forum, Vol. 54, 2 (2020), to appear.

Digital Library

[35]

Tetsuya Sakai and Ruihua Song. 2011. Evaluating Diversified Search Results Using Per-Intent Graded Relevance. In Proceedings of ACM SIGIR 2011.

Digital Library

[36]

Tetsuya Sakai and Ruihua Song. 2013. Diversified Search Evaluation: Lessons from the NTCIR-9 INTENT Task. Information Retrieval, Vol. 16, 4 (2013), 504--529.

Digital Library

[37]

Tetsuya Sakai and Peng Xiao. 2020. Randomised vs. Prioritised Pools for Relevance Assessments: Sample Size Considerations. In Proceedings of AIRS 2019. 94--105.

Digital Library

[38]

Tetsuya Sakai and Zhaohao Zeng. 2019. Which Diversity Evaluation Measures are "Good"?. In Proceedings of ACM SIGIR 2019. 595--604.

[39]

Mark Sanderson, Monica L. Paramita, Paul Clough, and Evangelos Kanoulas. 2010. Do user preferences and evaluation measures line up?. In Proceedings of ACM SIGIR 2010. 555--562.

Digital Library

[40]

Falk Scholer and Andrew Turpin. 2009. Metric and Relevance Mismatch and Retrieval Evaluation. In Proceedings of AIRS 2009 (LNCS 5839). 50--62.

Digital Library

[41]

Ruihua Song, Qingwei Guo, Ruochi Zhang, Guomao Xin, Ji-Rong Wen, Yong Yu, and Hsiao-Wuen Hon. 2011. Select-the-best-ones: A New Way to Judge Relative Relevance. Information Processing and Management, Vol. 47 (2011), 37--52.

Digital Library

[42]

Andrew Turpin and Falk Scholer. 2006. User Performance versus Precision Measures for Simple Search Tasks. In Proceedings of ACM SIGIR 2006. 11--18.

Digital Library

[43]

Michiko Yasukawa, J. Shane Culpepper, Falk Scholer, and Matthias Petri. 2011. RMIT and Gunma University at NTCIR-9 Intent Task. In Proceedings of NTCIR-9. 143--149.

Cited By

Seifikar MPhan Minh LArabzadeh NClarke CSmucker MChen HDuh WHuang HKato MMothe JPoblete B(2023)A Preference Judgment Tool for Authoritative AssessmentProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591801(3100-3104)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591801
Roitero KChecco AMizzaro SDemartini G(2022)Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance JudgmentsProceedings of the ACM Web Conference 202210.1145/3485447.3511960(319-327)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3511960
Yan XLuo CClarke CCraswell NVoorhees ECastells PAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Human Preferences as Dueling BanditsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531991(567-577)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531991
Show More Cited By

Index Terms

Good Evaluation Measures based on Document Preferences
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Presentation of retrieval results
      2. Retrieval effectiveness

Recommendations

Evaluation Measures Based on Preference Graphs
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

The offline evaluation of search requires us to define a standard against which we measure the quality of results returned by a ranker. Frequently this standard is defined in absolute terms through relevance grades, but it can also be defined in relative ...
Retrieval Evaluation Measures that Agree with Users’ SERP Preferences: Traditional, Preference-based, and Diversity Measures

We examine the “goodness” of ranked retrieval evaluation measures in terms of how well they align with users’ Search Engine Result Page (SERP) preferences for web search. The SERP preferences cover 1,127 topic-SERP-SERP triplets extracted from the NTCIR-...
Which Diversity Evaluation Measures Are "Good"?
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

This study evaluates 30 IR evaluation measures or their instances, of which nine are for adhoc IR and 21 are for diversified IR, primarily from the viewpoint of whether their preferences of one SERP (search engine result page) over another actually ...

Comments

Information & Contributors

Information

Published In

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2020

2548 pages

ISBN:9781450380164

DOI:10.1145/3397271

General Chairs:
Jimmy Huang
York University, Canada
,
Yi Chang
Jilin University, China
,
Xueqi Cheng
Chinese Academy of Sciences, China
,
Program Chairs:
Jaap Kamps
University of Amsterdam, Netherlands
,
Vanessa Murdock
Amazon, U.S.A.
,
Ji-Rong Wen
Renmin University of China, China
,
Yiqun Liu
Tsinghua University, China

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '20

Sponsor:

SIGIR

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval

July 25 - 30, 2020

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
336
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)2

Reflects downloads up to 29 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Seifikar MPhan Minh LArabzadeh NClarke CSmucker MChen HDuh WHuang HKato MMothe JPoblete B(2023)A Preference Judgment Tool for Authoritative AssessmentProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591801(3100-3104)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591801
Roitero KChecco AMizzaro SDemartini G(2022)Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance JudgmentsProceedings of the ACM Web Conference 202210.1145/3485447.3511960(319-327)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3511960
Yan XLuo CClarke CCraswell NVoorhees ECastells PAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Human Preferences as Dueling BanditsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531991(567-577)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531991
Arabzadeh NVtyurina AYan XClarke C(2022)Shallow pooling for sparse labelsInformation Retrieval Journal10.1007/s10791-022-09411-025:4(365-385)Online publication date: 20-Jul-2022
https://doi.org/10.1007/s10791-022-09411-0
Chu ZMao JZhang FLiu YSakai TZhang MMa SDemartini GZuccon GCulpepper JHuang ZTong H(2021)Evaluating Relevance Judgments with Pairwise Discriminative PowerProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482428(261-270)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482428
Clarke CVtyurina ASmucker M(2021)Assessing Top- PreferencesACM Transactions on Information Systems10.1145/345116139:3(1-21)Online publication date: 5-May-2021
https://dl.acm.org/doi/10.1145/3451161
Sakai TTao SZeng ZDiaz FShah CSuel TCastells PJones RSakai T(2021)WWW3E8: 259,000 Relevance Labels for Studying the Effect of Document Presentation Order for Relevance AssessorsProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463236(2376-2382)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3463236
Sakai T(2021)On the Instability of Diminishing Return IR MeasuresAdvances in Information Retrieval10.1007/978-3-030-72113-8_38(572-586)Online publication date: 27-Mar-2021
https://doi.org/10.1007/978-3-030-72113-8_38
Sakai TZeng Z(2020)Retrieval Evaluation Measures that Agree with Users’ SERP PreferencesACM Transactions on Information Systems10.1145/343181339:2(1-35)Online publication date: 31-Dec-2020
https://dl.acm.org/doi/10.1145/3431813

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents