Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
10.1145/3397271.3401115acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Good Evaluation Measures based on Document Preferences

Published: 25 July 2020 Publication History
  • Get Citation Alerts
  • Abstract

    For offline evaluation of IR systems, some researchers have proposed to utilise pairwise document preference assessments instead of relevance assessments of individual documents, as it may be easier for assessors to make relative decisions rather than absolute ones. Simple preference-based evaluation measures such as ppref and wpref have been proposed, but the past decade did not see any wide use of such measures. One reason for this may be that, while these new measures have been reported to behave more or less similarly to traditional measures based on absolute assessments, whether they actually align with the users' perception of search engine result pages (SERPs) has been unknown. The present study addresses exactly this question, after formally defining two classes of preference-based measures called Pref measures and Δ-measures. We show that the best of these measures perform at least as well as an average assessor in terms of agreement with users' SERP preferences, and that implicit document preferences (i.e., those suggested by a SERP that retrieves one document but not the other) play a much more important role than explicit preferences (i.e., those suggested by a SERP that retrieves one document above the other). We have released our data set containing 119,646 document preferences, so that the feasibility of document preferenced-based evaluation can be further pursued by the IR community.

    Supplementary Material

    MP4 File (3397271.3401115.mp4)
    19-minute video presentation of "Good Evaluation Measures based on Document Preferences,"\r\na SIGIR 2020 full paper by Tetsuya Sakai and Zhaohao Zeng (Waseda University, Japan).

    References

    [1]
    Rakesh Agrawal, Alan Halverson, Krishnaram Kenthapadi, Nina Mishra, and Panayiotis Tsaparas. 2009. Generating labels from clicks. In Proceedings of ACM WSDM 2009. 172--181.
    [2]
    Azzah Al-Maskari, Mark Sanderson, Paul Clough, and Eija Airio. 2008. The Good and the Bad System: Does the Test Collection Predict Users' Effectiveness. In Proceedings of ACM SIGIR 2018. 59--66.
    [3]
    Enrique Amigó, Damiano Spina, and Jorge Carrillo de Albornoz. 2018. An Axiomatic Analysis of Diversity Evaluation Metrics: Introducting the Rank-Biased Utility Metric. In Proceedings of ACM SIGIR 2018. 625--634.
    [4]
    Maryam Bashir, Jesse Anderton, Jie Wu, Peter B. Golbus, Virgil Pavlu, and Javed A. Aslam. 2013. A Document Rating System for Preference Judgements. In Proceedings of ACM SIGIR 2013. 909--912.
    [5]
    Ben Carterette and Paul N. Bennett. 2008. Evaluation Measures for Preference Judgments. In Proceedings of ACM SIGIR 2008. 685--686.
    [6]
    Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T. Dumais. 2008. Here or There: Preference Judgments for Relevance. In Proceedings of ECIR 2008 (LNCS 4956). 16--27.
    [7]
    Praveen Chandar and Ben Carterette. 2012. Using Preference Judgments for Novel Document Retrieval. In Proceedings of ACM SIGIR 2012. 861--870.
    [8]
    Praveen Chandar and Ben Carterette. 2013. Preference Based Evaluation Measures for Novelty and Diversity. In Proceedings of ACM SIGIR 2013. 413--422.
    [9]
    Olivier Chapelle, Donald Metzler, Ya Zhang, and Pierre Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. In Proceedings of ACM CIKM 2009. 621--630.
    [10]
    Zhicheng Dou, Ruihua Song, Xiaojie Yuan, and Ji-Rong Wen. 2008. Are Click-through Data Adequate for Learning Web Search Rankings?. In Proceedings of ACM CIKM 2008. 73--82.
    [11]
    H. P. Frei and P. Sch"auble. 1991. Determining the Effectiveness of Retrieval Algorithms. Information Processing and Management, Vol. 27, 2/3 (1991), 153--164.
    [12]
    Peter B. Golbus, Imed Zitouni, Jin Young Kim, Ahmed Hassan, and Fernando Diaz. 2014. Contextual and Dimensional Relevance Judgments for Reusable SERP-Level Evaluation. In Proceedings of WWW 2014. 131--142.
    [13]
    Jialong Han, Qinglei Wang, Naoki Orii, Zhicheng Dou, Tetsuya Sakai, and Ruihua Song. 2011. Microsoft Research Asia at the NTCIR-9 Intent Task. In Proceedings of NTCIR-9. 116--122.
    [14]
    William Hersh, Andrew Turpin, Susan Price, Benjamin Chen, Dale Kramer, Lynetta Sacherek, and Daniel Olson. 2000. Do Batch and User Evaluations Give the Same Results?. In Proceedings of ACM SIGIR 2000. 17--24.
    [15]
    Nadine Hoechstoetter and Dirk Lewandowski. 2015. What Users See - Structures in Search Engine Results Pages. arxiv: cs.IR/1511.05802
    [16]
    Katja Hofmann, Lihong Li, and Filip Radlinski. 2016. Online Evaluation for Information Retrieval. Foundations and Trends in Information Retrieval, Vol. 10, 1 (2016).
    [17]
    Kai Hui and Klaus Berberich. 2017a. Merge-Tie-Judge: Low-Cost Preference Judgments with Ties. In Proceedings of ACM ICTIR 2017. 277--280.
    [18]
    Kai Hui and Klaus Berberich. 2017b. Transitivity, Time Consumption, and Quality of Preference Judgements in Crowdsourcing. In Proceedings of ECIR 2017 (LNCS 10193). 239--251.
    [19]
    Kalervo Jírvelin and Jaana Kekílíainen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM Transactions on Information Systems, Vol. 20, 4 (2002), 422--446.
    [20]
    Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM TOIS, Vol. 25, 2 (2007).
    [21]
    Mami Kawasaki, Inho Kang, and Tetsuya Sakai. 2017. Ranking Rich Mobile Verticals based on Clicks and Abandonment. In Proceedings of ACM CIKM 2017. 2127--2130.
    [22]
    Gabriella Kazai, Emine Yilmaz, Nick Craswell, and S.M.M. Tahaghoghi. 2013. User Intent and Assessor Disagreement in Web Search Evaluation. In Proceedings of ACM CIKM 2013. 699--708.
    [23]
    Klaus Krippendorff. 2018. Content Analysis: An Introduction to Its Methodology (Fourth Edition). SAGE Publications.
    [24]
    Jeffrey D. Long and Norman Cliff. 1997. Confidence Intervals for Kendall's tau. Brit. J. Math. Statist. Psych., Vol. 50 (1997), 31--41.
    [25]
    Jiaxin Mao, Yiqun Liu, Ke Zhou, Jian-Yun Nie, Jingtao Song, Min Zhang, Shaoping Ma, Jiashen Sun, and Hengliang Luo. 2016. When Does Relevance Mean Usefulness and User Satisfaction in Web Search?. In Proceedings of ACM SIGIR 2016. 463--472.
    [26]
    Alistair Moffat, Peter Bailey, Falk Scholer, and Paul Thomas. 2017. Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness. ACM TOIS, Vol. 35, 3 (2017).
    [27]
    Alistair Moffat and Justin Zobel. 2008. Rank-Biased Precision for Measurement of Retrieval Effectiveness. ACM TOIS, Vol. 27, 1 (2008).
    [28]
    Kira Radinsky and Nir Ailon. 2011. Ranking from Pairs and Triplets: Information Quality, Evaluation Methods and Query Complexity. In Proceedings of ACM WSDM 2011. 105--114.
    [29]
    Tetsuya Sakai. 2007. Alternatives to Bpref. In Proceedings of ACM SIGIR 2007. 71--78.
    [30]
    Tetsuya Sakai. 2012. Evaluation with Informational and Navigational Intents. In Proceedings of WWW 2012. 499--508.
    [31]
    Tetsuya Sakai. 2014. Metrics, Statistics, Tests. In PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173). 116--163.
    [32]
    Tetsuya Sakai. 2018. Laboratory Experiments in Information Retrieval: Sample Sizes, Effect Sizes, and Statistical Power. Springer.
    [33]
    Tetsuya Sakai. 2019. How to Run an Evaluation Task: with a Primary Focus on Ad Hoc Information Retrieval. In Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF, Nicola Ferro and Carol Peters (Eds.). Springer, 71--102.
    [34]
    Tetsuya Sakai. 2020. On Fuhr's Guideline for IR Evaluation. SIGIR Forum, Vol. 54, 2 (2020), to appear.
    [35]
    Tetsuya Sakai and Ruihua Song. 2011. Evaluating Diversified Search Results Using Per-Intent Graded Relevance. In Proceedings of ACM SIGIR 2011.
    [36]
    Tetsuya Sakai and Ruihua Song. 2013. Diversified Search Evaluation: Lessons from the NTCIR-9 INTENT Task. Information Retrieval, Vol. 16, 4 (2013), 504--529.
    [37]
    Tetsuya Sakai and Peng Xiao. 2020. Randomised vs. Prioritised Pools for Relevance Assessments: Sample Size Considerations. In Proceedings of AIRS 2019. 94--105.
    [38]
    Tetsuya Sakai and Zhaohao Zeng. 2019. Which Diversity Evaluation Measures are "Good"?. In Proceedings of ACM SIGIR 2019. 595--604.
    [39]
    Mark Sanderson, Monica L. Paramita, Paul Clough, and Evangelos Kanoulas. 2010. Do user preferences and evaluation measures line up?. In Proceedings of ACM SIGIR 2010. 555--562.
    [40]
    Falk Scholer and Andrew Turpin. 2009. Metric and Relevance Mismatch and Retrieval Evaluation. In Proceedings of AIRS 2009 (LNCS 5839). 50--62.
    [41]
    Ruihua Song, Qingwei Guo, Ruochi Zhang, Guomao Xin, Ji-Rong Wen, Yong Yu, and Hsiao-Wuen Hon. 2011. Select-the-best-ones: A New Way to Judge Relative Relevance. Information Processing and Management, Vol. 47 (2011), 37--52.
    [42]
    Andrew Turpin and Falk Scholer. 2006. User Performance versus Precision Measures for Simple Search Tasks. In Proceedings of ACM SIGIR 2006. 11--18.
    [43]
    Michiko Yasukawa, J. Shane Culpepper, Falk Scholer, and Matthias Petri. 2011. RMIT and Gunma University at NTCIR-9 Intent Task. In Proceedings of NTCIR-9. 143--149.

    Cited By

    View all
    • (2023)A Preference Judgment Tool for Authoritative AssessmentProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591801(3100-3104)Online publication date: 19-Jul-2023
    • (2022)Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance JudgmentsProceedings of the ACM Web Conference 202210.1145/3485447.3511960(319-327)Online publication date: 25-Apr-2022
    • (2022)Human Preferences as Dueling BanditsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531991(567-577)Online publication date: 6-Jul-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2020
    2548 pages
    ISBN:9781450380164
    DOI:10.1145/3397271
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 July 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. adhoc retrieval
    2. document preferences
    3. evaluation measures
    4. preference assessments
    5. serp preferences.

    Qualifiers

    • Research-article

    Conference

    SIGIR '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 29 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A Preference Judgment Tool for Authoritative AssessmentProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591801(3100-3104)Online publication date: 19-Jul-2023
    • (2022)Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance JudgmentsProceedings of the ACM Web Conference 202210.1145/3485447.3511960(319-327)Online publication date: 25-Apr-2022
    • (2022)Human Preferences as Dueling BanditsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531991(567-577)Online publication date: 6-Jul-2022
    • (2022)Shallow pooling for sparse labelsInformation Retrieval Journal10.1007/s10791-022-09411-025:4(365-385)Online publication date: 20-Jul-2022
    • (2021)Evaluating Relevance Judgments with Pairwise Discriminative PowerProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482428(261-270)Online publication date: 26-Oct-2021
    • (2021)Assessing Top- PreferencesACM Transactions on Information Systems10.1145/345116139:3(1-21)Online publication date: 5-May-2021
    • (2021)WWW3E8: 259,000 Relevance Labels for Studying the Effect of Document Presentation Order for Relevance AssessorsProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463236(2376-2382)Online publication date: 11-Jul-2021
    • (2021)On the Instability of Diminishing Return IR MeasuresAdvances in Information Retrieval10.1007/978-3-030-72113-8_38(572-586)Online publication date: 27-Mar-2021
    • (2020)Retrieval Evaluation Measures that Agree with Users’ SERP PreferencesACM Transactions on Information Systems10.1145/343181339:2(1-35)Online publication date: 31-Dec-2020

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media