An Improving Genetic Programming Approach Based Deduplication Using KFINDMR

P.Shanmugavadivu; N.Baskar

doi:10.14445/22312803/IJCTT-V3I5P106

Research Article | Open Access | Download PDF

Volume 3 | Issue 5 | Year 2012 | Article Id. IJCTT-V3I5P106 | DOI : https://doi.org/10.14445/22312803/IJCTT-V3I5P106

An Improving Genetic Programming Approach Based Deduplication Using KFINDMR

P.Shanmugavadivu, N.Baskar

Citation :

P.Shanmugavadivu, N.Baskar, "An Improving Genetic Programming Approach Based Deduplication Using KFINDMR," International Journal of Computer Trends and Technology (IJCTT), vol. 3, no. 5, pp. 694-701, 2012. Crossref, https://doi.org/10.14445/22312803/IJCTT-V3I5P106

Abstract

The record deduplication is the task of identifying, in a data repository, records that refer to the same real world entity or object in spite of misspelling words, types, different writing styles or even different schema representations or data types. In existing system aims at providing Unsupervised Duplication Detection (UDD) method which can be used to identify and remove the duplicate records from different data sources. Starting from the non duplicate set, the two cooperating classifiers, a Weighted Component Similarity Summing Classifier (WCSS) and Support Vector Machine (SVM) are used to iteratively identify the duplicate records from the non duplicate record and present a genetic programming (GP) approach to record deduplication. Their GP-based approach is also able to automatically find effective deduplication functions. The genetic programming approach is time consuming task so we propose new algorithm KFINDMR (KFIND using Most Represented data samples) to find the most represented data samples to improve the accuracy of the classifier. The proposed system calculates the mean value of the most represented data samples in centroid of the record members; it selects the first most represented data sample that closest to the mean value calculates the minimum distance. The system Remove the duplicate dataset samples in the system and find the optimization solution to deduplication of records or data samples.

Keywords

Extracting data, identifying duplication, deduplication, genetic programming.

References

[1] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios, “Duplicate Record Detection: A Survey”, IEEE transactions on knowledge and data engineering, vol. 19, no. 1,January 2007.
[2] Gengxin Miao1 Junichi Tatemura2 Wang-Pin Hsiung2 Arsany Sawires2 Louise E. Moser11 ECE Dept., University of California, Santa Barbara, Santa Barbara, CA, 93106 2 NEC Laboratories America, 10080 N. Wolfe Rd SW3-350, Cupertino, CA, 95014, “Extracting Data Records from the Web Using Tag Path Clustering”.
[3] Imran R. Mansuriimran@it.iitb.ac.inIIT Bombay ,Sunita Sarawagi sunita@it.iitb.ac.inIIT Bombay, “Integrating unstructured data into relational databases”.
[4] Jaehong Min, Daeyoung Yoon, and Youjip Won,“Efficient Deduplication Techniques for Modern Backup Operation”, IEEE transactions on computers, vol. 60, no. 6, June 2011.
[5] B. Liu. Mining data records in Web pages. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, pages 601-606, 2003.
[6] B. Liu and Y. Zhai. NET: System for extracting Web data from °at and nested data records. In Proceedings of the Conference on Web Information Systems Engineering, pages 487-495, 2005.
[7] Moise´s G. de Carvalho, Alberto H.F. Laender, Marcos Andre´ Gonc¸alves, and Altigran S. da Silva “A Genetic Programming Approach to Record Deduplication” IEEE Transaction on knowledge and data engineering,vol.24, No.3, March 2012.
[8] Sarawagi and A. Bhamidipaty, “Interactive Deduplication Using Active Learning,” Proc. Eighth ACM SIGKDD Int’l Conf.Knowledge Discovery and Data Mining, pp. 269-278, 2002.
[9] Yihong Ding, A Thesis Proposal Presented to the Department of Computer Science Brigham Young University, “Semiautomatic Generation of Data-Extraction Ontologies”, July 3, 2001.
[10] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg, “Adaptive Name Matching in Information Integration,” IEEE Intelligent Systems, vol. 18, no. 5, pp. 16-23, Sept./Oct. 2003.
[11] M. Bilenko and R.J. Mooney, “Adaptive Duplicate Detection Using Learnable String Similarity Measures,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 39- 48, 2003.
[12] Weifeng Su, Jiying Wang, and Frederick H. Lochovsky,” Record Matching over Query Results from Multiple Web Databases”, Knowledge Discovery and Data Mining, VOL. 22, NO. 4, APRIL 2010.
[13] M.G. de Carvalho, A.H.F. Laender, M.A. Gonc¸alves, and A.S. da Silva, “Replica Identification Using Genetic Programming,” Proc. 23rd Ann. ACM Symp. Applied Computing (SAC), pp. 1801-1806, 2008.
[14] T.P.C. Silva, E.S. de Moura, J.M.B. Cavalcanti, A.S. da Silva, M.G. de Carvalho, and M.A. onc¸alves, “An Evolutionary Approach for Combining Different Sources of Evidence in Search Engines,” Information Systems, vol. 34, no. 2, pp. 276-289, 2009.
[15] Bilal Khan, Azhar Ranf, Sajid H.Shah and ShanKhusrso “Identification and removal of Duplicated Records” World Applied Sciences Journal 13(5):1187-1184, 2011.