Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
research-article

Knowledge-based trust: estimating the trustworthiness of web sources

Published: 01 May 2015 Publication History
  • Get Citation Alerts
  • Abstract

    The quality of web sources has been traditionally evaluated using exogenous signals such as the hyperlink structure of the graph. We propose a new approach that relies on endogenous signals, namely, the correctness of factual information provided by the source. A source that has few false facts is considered to be trustworthy.
    The facts are automatically extracted from each source by information extraction methods commonly used to construct knowledge bases. We propose a way to distinguish errors made in the extraction process from factual errors in the web source per se, by using joint inference in a novel multi-layer probabilistic model.
    We call the trustworthiness score we computed Knowledge-Based Trust (KBT). On synthetic data, we show that our method can reliably compute the true trustworthiness levels of the sources. We then apply it to a database of 2.8B facts extracted from the web, and thereby estimate the trustworthiness of 119M webpages. Manual evaluation of a subset of the results confirms the effectiveness of the method.

    References

    [1]
    J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1): 1--41, 2008.
    [2]
    K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247--1250, 2008.
    [3]
    A. Borodin, G. Roberts, J. Rosenthal, and P. Tsaparas. Link analysis ranking: algorithms, theory, and experiments. TOIT, 5: 231--297, 2005.
    [4]
    S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1--7): 107--117, 1998.
    [5]
    C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. In SIGIR, 2007.
    [6]
    C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. Flumejava: Easy, efficient data-parallel pipelines. In PLDI, pages 363--375, 2010.
    [7]
    X. L. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 2010.
    [8]
    X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, 2(1), 2009.
    [9]
    X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1), 2009.
    [10]
    X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014.
    [11]
    X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 2014.
    [12]
    X. L. Dong and F. Naumann. Data fusion--resolving data conflicts for integration. PVLDB, 2009.
    [13]
    X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6, 2013.
    [14]
    O. Etzioni, A. Fader, J. Christensen, S. Soderland, and Mausam. Open information extraction: the second generation. In IJCAI, 2011.
    [15]
    L. A. Galárraga, C. Teflioudi, K. Hose, and F. Suchanek. Amie: association rule mining under incomplete evidence in ontological knowledge bases. In WWW, pages 413--422, 2013.
    [16]
    Top 15 most popular celebrity gossip websites. http://www.ebizmba.com/articles/gossip-websites, 2014.
    [17]
    Z. Gyngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In VLDB, pages 576--587, 2014.
    [18]
    S. Kamvar, M. Schlosser, and H. Garcia-Molina. The Eigentrust algorithm for reputation management in P2P networks. In WWW, 2003.
    [19]
    J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, 1998.
    [20]
    V. Krishnan and R. Raj. Web spam detection with anti-trust rank. In AIRWeb, 2006.
    [21]
    Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In SIGMOD, pages 1187--1198, 2014.
    [22]
    X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth finding on the Deep Web: Is the problem solved? PVLDB, 6(2), 2013.
    [23]
    X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Scaling up copy detection. In ICDE, 2015.
    [24]
    J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, pages 877--885, 2010.
    [25]
    J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In IJCAI, pages 2324--2329, 2011.
    [26]
    J. Pasternack and D. Roth. Latent credibility analysis. In WWW, 2013.
    [27]
    R. Pochampally, A. D. Sarma, X. L. Dong, A. Meliou, and D. Srivastava. Fusing data with correlations. In Sigmod, 2014.
    [28]
    A. Singh and L. Liu. TrustMe: anonymous management of trust relationshiops in decentralized P2P systems. In IEEE Intl. Conf. on Peer-to-Peer Computing, 2003.
    [29]
    M. Wu and A. Marian. Corroborating answers from multiple web sources. In Proc. of the WebDB Workshop, 2007.
    [30]
    X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In Proc. of SIGKDD, 2007.
    [31]
    X. Yin and W. Tan. Semi-supervised truth discovery. In WWW, pages 217--226, 2011.
    [32]
    B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. In QDB, 2012.
    [33]
    B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6): 550--561, 2012.

    Cited By

    View all
    • (2024)Stability of Weighted Majority Voting under Estimated WeightsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662856(96-104)Online publication date: 6-May-2024
    • (2024)FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous DataProceedings of the VLDB Endowment10.14778/3648160.364817417:6(1337-1349)Online publication date: 3-May-2024
    • (2023)Maximizing Neutrality in News OrderingProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599425(11-24)Online publication date: 6-Aug-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    Proceedings of the VLDB Endowment  Volume 8, Issue 9
    May 2015
    76 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 May 2015
    Published in PVLDB Volume 8, Issue 9

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)1

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Stability of Weighted Majority Voting under Estimated WeightsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662856(96-104)Online publication date: 6-May-2024
    • (2024)FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous DataProceedings of the VLDB Endowment10.14778/3648160.364817417:6(1337-1349)Online publication date: 3-May-2024
    • (2023)Maximizing Neutrality in News OrderingProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599425(11-24)Online publication date: 6-Aug-2023
    • (2023)Toward the adoption of digital assistive technologyTelecommunications Policy10.1016/j.telpol.2022.10248347:2Online publication date: 1-Mar-2023
    • (2022)Saga: A Platform for Continuous Construction and Serving of Knowledge at ScaleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526049(2259-2272)Online publication date: 10-Jun-2022
    • (2022)Learning Trustworthy Web Sources to Derive Correct Answers and Reduce Health Misinformation in SearchProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531812(2099-2104)Online publication date: 6-Jul-2022
    • (2022)Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarityThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00757-x32:3(475-500)Online publication date: 19-Jul-2022
    • (2021)Rethinking searchACM SIGIR Forum10.1145/3476415.347642855:1(1-27)Online publication date: 16-Jul-2021
    • (2021)Information Extraction From Co-Occurring Similar EntitiesProceedings of the Web Conference 202110.1145/3442381.3449836(3999-4009)Online publication date: 19-Apr-2021
    • (2020)On detecting cherry-picked trendlinesProceedings of the VLDB Endowment10.14778/3380750.338076213:6(939-952)Online publication date: 1-Feb-2020
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media