Automating Duplicate Detection for Lexical Heterogeneous Web Databases

Anil       Ahlawat; Kalpna       Sagar

Abstract

Introduction: The need for efficient search engines has been identified with the everincreasing technological advancement and huge growing demand for data on the web.

Method: Automating duplicate detection over a query results in identifying the records from multiple web databases that point to a similar real-world entity and return non-matching records to the end-users. The proposed algorithm in this paper is based on an unsupervised approach with classifiers over heterogeneous web databases that return more accurate results with high precision, Fmeasure, and recall. Different assessments have also been executed to analyze the efficacy of the proposed algorithm for the identification of duplicates.

Result: Results show that the proposed algorithm has greater precision, F-score measure, and the same recall values as compared to standard UDD.

Discussion: This paper aims to introduce an algorithm that automates the process of duplicate detection for lexical heterogeneous web databases.

Conclusion: This paper concludes that the proposed algorithm outperforms the standard UDD.

Keywords: Duplicate detection, record linkage, weighted component similarity summing, data mining, web databases, web browser.

Graphical Abstract

[1] 
E. Rahm,  and H.H. Do, "Data Cleaning: Problems and current approaches", Q. Bull. Comput. Soc. IEEE Tech. Comm. Data Eng., vol. 23, no. 4, pp. 3-13, 2000.
[2] 
A.K. Elmagarmid, P.G. Ipeirotis,  and V.S. Verykios, "Duplicate record detection: A survey", IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 1-16, 2006.
[http://dx.doi.org/10.1109/TKDE.2007.250581] 
[3] 
M. Karthigha,  and S. Krishna Anand, "A survey on removal of duplicate records in database", Indian J. Sci. Technol., vol. 6, no. 4, pp. 4306-4311, 2013.
[4] 
S.R. Alenazi,  and K. Ahmad, "Record duplication detection in database: A review", Int. J. Adv. Sci. Eng. Informat. Technol., vol. 6, no. 6, pp. 838-845, 2016.
[http://dx.doi.org/10.18517/ijaseit.6.6.1368] 
[5] 
S.R. Alenazi, K. Ahmad,  and A. Olowolayemo, "A review of similarity measurement for record duplication detection", 6th International Conference on Electrical Engineering and Informatics (ICEEI) Langkawi, Malaysia 2017
[6] 
A. Bhamidipaty Sarawagi, "Interactive deduplication using active learning", Proc. Eighth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, 2002pp. 269-278 
[7] 
 Available from: http://www.archives.gov/research/census/soundex.html
[8] 
L. Philips, "Hanging on the Metaphone", Comput. Lang. Mag., vol. 7, no. 12, pp. 39-44, 1990.
[9] 
 http://www.cuj.com/documents/s=8038/cuj0006philips/
[10] 
L. Philips, "The Double Metaphone Search Algorithm", C/C++ Users J., vol. 18, no. 5, pp. 38-43, 2000.
[11] 
W. Su, J. Wang,  and F.H. Lochovsky, "Record Matching over Query Results from Multiple Web Databases", IEEE Trans. Knowl. Data Eng., vol. 22, no. 4, pp. 578-589, 2010.
[12] 
S.B. Kotsiantis, "Supervised learning: A review of classification techniques", Informatica, vol. 1, no. 31, pp. 249-268, 2007.
[13] 
S. Hemalatha, K. Raja,  and A. Tholkappia, Duplicate Detection of Query Results from Multiple Web Databases., IJCA Special Issue on Computational Science—New Dimension & Perspectives, 2011.
[14] 
P. Dayan, Unsupervised Learning. http://www.gatsby.ucl.ac.uk/~dayan/papers/dun99b.pdf
[15] 
B. Daggupati,  "Unsupervised Duplicate Detection (UDD) of query results from multiple web databases”, M.S thesis, California State University, Los Angeles, CA , 2011.
[16] 
S. Gaikwad,  and B. Nagaraju, "A survey analysis on duplicate detection in hierarchical data", 2015 International Conference on Pervasive Computing (ICPC) IEEE, Pune, India, 2015 
[17] 
P. Christen, "A survey of indexing techniques for scalable record linkage and deduplication", IEEE Trans. Knowl. Data Eng., vol. 24, no. 9, pp. 1537-1555, 2011.
[http://dx.doi.org/10.1109/TKDE.2011.127] 
[18] 
G. Li, Q. Wu, D. Tu,  and S. Sun, "A sorted neighborhood approach
for detecting duplicated regions in image forgeries based on DWT
and SVD", in Multimedia and Expo, 2007 IEEE International Conference on, 2007.
[19] 
H. Lu, X. Chen, X. Lan,  and F. Zheng, "Duplicate data detection using GNN", n 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), 2016", 
[20] 
 Benedikt Forchhammer, "Duplicate detection on GPUs", HPI Future SOC Lab 70.3, Jan 2013.
[21] 
A. Ferro, R. Giugno, P.L. Puglisi,  and A. Pulvirenti, An efficient duplicate record detection using q-grams array inverted index.in Data Warehousing and Knowledge Discovery., Berlin Heidelberg: Berlin, Heidelberg Springer, 2010, pp. 309-323.
[http://dx.doi.org/10.1007/978-3-642-15105-7_25] 
[22] 
J. Barateiro,  and H. Galhardas, "A survey of data quality tools", Datenbank-Spektrum, vol. 14, no. 5, pp. 15-21, 2005.
[23] 
M. Li, H. Wang, J. Li,  and H. Gao, Efficient duplicate record detection based on similarity estimation. In Web-Age Information Management., Berlin Heidelberg: Berlin, Heidelberg Springer, 2010, pp. 595-607.
[http://dx.doi.org/10.1007/978-3-642-14246-8_58] 
[24] 
A. Higazy, T.E. Tobely, A.H. Yousef,  and A. Sarhan,  Web-based
Arabic/English duplicate record detection with nested blocking
technique", in 2013 8th International Conference on Computer Engineering
& Systems (ICCES), 2013.
[http://dx.doi.org/10.1109/ICCES.2013.6707225] 
[25] 
A. Rudniy, M. Song,  and J. Geller,  "Shortest path edit distance for detecting duplicate biological entities", In Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology - BCB ’10, 2010.
[http://dx.doi.org/10.1145/1854776.1854851] 
[26] 
G.V. Dhivyabharathi,  and S. Kumaresan,  A survey on duplicate
record detection in real world data", in 2016 3rd International Conference
on Advanced Computing and Communication Systems
(ICACCS), 2016.
[http://dx.doi.org/10.1109/ICACCS.2016.7586397] 
[27] 
W.E. Winkler,  “Overview of record linkage and current research directions” Bureau of the Census, Feb 2006.
[28] 
R. Baxter, P. Christen,  and T. Churches, "A Comparison of Fast Blocking Methods for Record Linkage", Proc. KDD Workshop Data Cleaning, Record Linkage, and Object Consolidation, 2003pp. 25-27 
[29] 
W.E. Winkler, "The state of record linkage and current research problems", Technical Report RR99/04. US Census Bureau, 1999.
[30] 
I.P. Fellegi,  and A.B. Sunter, "A theory for record linkage", J. Am. Stat. Assoc., vol. 64, no. 328, pp. 1183-1210, 2012.
[31] 
C.C. Chang,  and C-J. Lin, "LIBSVM: A library for support vector machines, manual", ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1-27, 2011.
[32] 
P. Christen, "Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification", in Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 08, 2008.
[33] 
"Yu, AIii, Han, J, and Chang, C.C, “PEBL: Web page classification without negative examples”, IEEE Trans Knowledge Data Engineer", Eng, vol. 16, no. 1, pp. 70-812004, 2004.
[34] 
 Available from: http://www.isi.edu/integration/papers/tejada02-kdd.pdf
[35] 
 Available from: http://www.cs.utexas.edu/users/ml/riddle/data/restaurant.tar.gz
[36] 
 Available from: http://www.cs.utexas.edu/~ml/papers/marlin-kdd-wkshp-03.pdf
[37] 
B. Mikhail, "and M. Raymond J", Proceedings of the KDD-03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, p. pp. 7-12.
[38] 
P. Ravikumar,  and W.W. Cohen,  A hierarchical graphical model for record linkage", arXiv [cs.LG], 2012.

Cite As

Recent Advances in Computer Science and Communications

Automating Duplicate Detection for Lexical Heterogeneous Web Databases

Abstract

Graphical Abstract