Svoboda | Graniru | BBC Russia | Golosameriki | Facebook

research-article

Entity resolution: theory, practice & open challenges

Authors:

Ashwin MachanavajjhalaAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 5, Issue 12

Pages 2018 - 2019

https://doi.org/10.14778/2367502.2367564

Published: 01 August 2012 Publication History

Abstract

This tutorial brings together perspectives on ER from a variety of fields, including databases, machine learning, natural language processing and information retrieval, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges, and open research problems.

References

[1]

A. Arasu, M. Goetz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, 2010.

[2]

O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB Journal, 18(1), 2009.

[3]

I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In SDM, 2006.

[4]

I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery in Data, 1(1), 2007.

[5]

M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage and clustering. In ICDM, 2006.

[6]

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003.

[7]

S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In ICDE, 2005.

[8]

P. Christen. Data Matching. Springer, 2012.

[9]

W. Cohen and P. Ravikumar. A hierarchical graphical model for record linkage. In Proc. of UAI, 2004.

[10]

W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proc. of IJCAI, 2003.

[11]

X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005.

[12]

I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(2283), 1969.

[13]

L. Gravano, P. Ipeirotis, N. Koudas, and D. Srivastava. Text joins for data cleansing and integration in an rdbms. In ICDE, 2003.

[14]

M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995.

[15]

D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SDM, 2005.

[16]

H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010.

[17]

N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: Similarity measures and algorithms. Tutorial at SIGMOD, 2006.

[18]

A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD, 2000.

[19]

A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, 2004.

[20]

D. Menestrina, S. E. Whang, and H. Garcia-Molina. Evaluating entity resolution results. In PVLDB, 2010.

[21]

M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, 2006.

[22]

A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.

[23]

H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2003.

[24]

V. Rastogi, N. Dalvi, and M. Garofalakis. Large-scale collective entity matching. In PVLDB, 2012.

[25]

E. S. Ristad and P. N. Yianilos. Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998.

[26]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002.

[27]

W. Shen, X. Li, and A. Doan. Constraint-based entity matching. In AAAI, 2005.

[28]

P. Singla and P. Domingos. Multi-relational record linkage. In KDD, 2004.

[29]

P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, 2006.

[30]

S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD, 2009.

[31]

W. E. Winkler. Methods for record linkage and bayesian networks. Technical report, Statistical Research Division, U. S. Census Bureau, 2002.

Cited By

Wolfe RSlaughter IHan BWen BYang YRosenblatt LHerman BBrown EQu ZWeber NHowe B(2024)Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource SettingsProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658966(1199-1210)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658966
Zeakis APapadakis GSkoutas DKoubarakis M(2023)Pre-Trained Embeddings for Entity Resolution: An Experimental AnalysisProceedings of the VLDB Endowment10.14778/3598581.359859416:9(2225-2238)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.14778/3598581.3598594
Leventidis ADi Rocco LGatterbauer WMiller RRiedewald M(2023)DomainNet: Homograph Detection and Understanding in Data Lake DisambiguationACM Transactions on Database Systems10.1145/361291948:3(1-40)Online publication date: 12-Sep-2023
https://dl.acm.org/doi/10.1145/3612919
Show More Cited By

Entity resolution: theory, practice & open challenges
1. Information systems

Recommendations

Entity resolution for big data
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Entity resolution (ER), the problem of extracting, matching and resolving entity mentions in structured and unstructured data, is a long-standing challenge in database management, information retrieval, machine learning, natural language processing and ...
Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Handling data quality in entity resolution
IQIS '05: Proceedings of the 2nd international workshop on Information quality in information systems

Entity resolution (ER) is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers).However, there are no unique identifiers that tell us what ...

Comments

Information & Contributors

Information

Published In

Proceedings of the VLDB Endowment Volume 5, Issue 12

August 2012

340 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2012

Published in PVLDB Volume 5, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

99
Total Citations
View Citations
2,054
Total Downloads

Downloads (Last 12 months)118
Downloads (Last 6 weeks)6

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wolfe RSlaughter IHan BWen BYang YRosenblatt LHerman BBrown EQu ZWeber NHowe B(2024)Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource SettingsProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658966(1199-1210)Online publication date: 3-Jun-2024
Zeakis APapadakis GSkoutas DKoubarakis M(2023)Pre-Trained Embeddings for Entity Resolution: An Experimental AnalysisProceedings of the VLDB Endowment10.14778/3598581.359859416:9(2225-2238)Online publication date: 1-May-2023
Leventidis ADi Rocco LGatterbauer WMiller RRiedewald M(2023)DomainNet: Homograph Detection and Understanding in Data Lake DisambiguationACM Transactions on Database Systems10.1145/361291948:3(1-40)Online publication date: 12-Sep-2023
Fan WHan ZWang YXie M(2023)Discovering Top-k Rules using Subjective and Objective CriteriaProceedings of the ACM on Management of Data10.1145/35889241:1(1-29)Online publication date: 30-May-2023
Genossar BShraga RGal A(2023)FlexER: Flexible Entity Resolution for Multiple IntentsProceedings of the ACM on Management of Data10.1145/35887221:1(1-27)Online publication date: 30-May-2023
Wu RBendeck AChu XHe Y(2023)Ground Truth Inference for Weakly Supervised Entity MatchingProceedings of the ACM on Management of Data10.1145/35887121:1(1-28)Online publication date: 30-May-2023
Xie JHe XWang JQiu ZKebarighotbi AGhassemi FChen HDuh WHuang HKato MMothe JPoblete B(2023)SimTDE: Simple Transformer Distillation for Sentence EmbeddingsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592063(2389-2393)Online publication date: 19-Jul-2023
Nevin JGroth PLees M(2023)Data Integration Landscapes: The Case for Non-optimal Solutions in Network Diffusion ModelsComputational Science – ICCS 202310.1007/978-3-031-35995-8_35(494-508)Online publication date: 3-Jul-2023
(2023)Geospatial Data ScienceundefinedOnline publication date: 9-Jun-2023
Shraga R(2022)HumanALProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3546930.3547496(1-8)Online publication date: 12-Jun-2022
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents