Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
research-article

Entity resolution: theory, practice & open challenges

Published: 01 August 2012 Publication History

Abstract

This tutorial brings together perspectives on ER from a variety of fields, including databases, machine learning, natural language processing and information retrieval, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges, and open research problems.

References

[1]
A. Arasu, M. Goetz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, 2010.
[2]
O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB Journal, 18(1), 2009.
[3]
I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In SDM, 2006.
[4]
I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery in Data, 1(1), 2007.
[5]
M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage and clustering. In ICDM, 2006.
[6]
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003.
[7]
S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In ICDE, 2005.
[8]
P. Christen. Data Matching. Springer, 2012.
[9]
W. Cohen and P. Ravikumar. A hierarchical graphical model for record linkage. In Proc. of UAI, 2004.
[10]
W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proc. of IJCAI, 2003.
[11]
X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005.
[12]
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(2283), 1969.
[13]
L. Gravano, P. Ipeirotis, N. Koudas, and D. Srivastava. Text joins for data cleansing and integration in an rdbms. In ICDE, 2003.
[14]
M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995.
[15]
D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SDM, 2005.
[16]
H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010.
[17]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: Similarity measures and algorithms. Tutorial at SIGMOD, 2006.
[18]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD, 2000.
[19]
A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, 2004.
[20]
D. Menestrina, S. E. Whang, and H. Garcia-Molina. Evaluating entity resolution results. In PVLDB, 2010.
[21]
M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, 2006.
[22]
A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
[23]
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2003.
[24]
V. Rastogi, N. Dalvi, and M. Garofalakis. Large-scale collective entity matching. In PVLDB, 2012.
[25]
E. S. Ristad and P. N. Yianilos. Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998.
[26]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002.
[27]
W. Shen, X. Li, and A. Doan. Constraint-based entity matching. In AAAI, 2005.
[28]
P. Singla and P. Domingos. Multi-relational record linkage. In KDD, 2004.
[29]
P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, 2006.
[30]
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD, 2009.
[31]
W. E. Winkler. Methods for record linkage and bayesian networks. Technical report, Statistical Research Division, U. S. Census Bureau, 2002.

Cited By

View all
  • (2024)Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource SettingsProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658966(1199-1210)Online publication date: 3-Jun-2024
  • (2023)Pre-Trained Embeddings for Entity Resolution: An Experimental AnalysisProceedings of the VLDB Endowment10.14778/3598581.359859416:9(2225-2238)Online publication date: 1-May-2023
  • (2023)DomainNet: Homograph Detection and Understanding in Data Lake DisambiguationACM Transactions on Database Systems10.1145/361291948:3(1-40)Online publication date: 12-Sep-2023
  • Show More Cited By
  1. Entity resolution: theory, practice & open challenges

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    Proceedings of the VLDB Endowment  Volume 5, Issue 12
    August 2012
    340 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2012
    Published in PVLDB Volume 5, Issue 12

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)118
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 21 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource SettingsProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658966(1199-1210)Online publication date: 3-Jun-2024
    • (2023)Pre-Trained Embeddings for Entity Resolution: An Experimental AnalysisProceedings of the VLDB Endowment10.14778/3598581.359859416:9(2225-2238)Online publication date: 1-May-2023
    • (2023)DomainNet: Homograph Detection and Understanding in Data Lake DisambiguationACM Transactions on Database Systems10.1145/361291948:3(1-40)Online publication date: 12-Sep-2023
    • (2023)Discovering Top-k Rules using Subjective and Objective CriteriaProceedings of the ACM on Management of Data10.1145/35889241:1(1-29)Online publication date: 30-May-2023
    • (2023)FlexER: Flexible Entity Resolution for Multiple IntentsProceedings of the ACM on Management of Data10.1145/35887221:1(1-27)Online publication date: 30-May-2023
    • (2023)Ground Truth Inference for Weakly Supervised Entity MatchingProceedings of the ACM on Management of Data10.1145/35887121:1(1-28)Online publication date: 30-May-2023
    • (2023)SimTDE: Simple Transformer Distillation for Sentence EmbeddingsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592063(2389-2393)Online publication date: 19-Jul-2023
    • (2023)Data Integration Landscapes: The Case for Non-optimal Solutions in Network Diffusion ModelsComputational Science – ICCS 202310.1007/978-3-031-35995-8_35(494-508)Online publication date: 3-Jul-2023
    • (2023)Geospatial Data ScienceundefinedOnline publication date: 9-Jun-2023
    • (2022)HumanALProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3546930.3547496(1-8)Online publication date: 12-Jun-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media