An Improving Genetic Programming Approach Based Deduplication Using KFINDMR

International Journal of Computer Trends and Technology (IJCTT)          
© - Issue 2012 by IJCTT Journal
Volume-3 Issue-5                           
Year of Publication : 2012
Authors :P.Shanmugavadivu, N.Baskar.


P.Shanmugavadivu, N.Baskar."An Improving Genetic Programming Approach Based Deduplication Using KFINDMR."International Journal of Computer Trends and Technology (IJCTT),V3(5):543-547 Issue 2012 .ISSN Published by Seventh Sense Research Group.

Abstract: - —The record deduplication is the task of identifying, in a data repository, records that refer to the same real world entity or object in spite of misspelling words, types, different writing styles or even different schema representations or data types. In existing system aims at providing Unsupervised Duplication Detection (UDD) method which can be used to identify and remove the duplicate records from different data sources. Starting from the non duplicate set, the two cooperating classifiers, a Weighted Component Similarity Summing Classifier (WCSS) and Support Vector Machine (SVM) are used to iteratively identify the duplicate records from the non duplicate record and present a genetic programming (GP) approach to record deduplication. Their GP-based approach is also able to automatically find effective deduplication functions. The genetic programming approach is time consuming task so we propose new algorithm KFINDMR (KFIND using Most Represented data samples) to find the most represented data samples to improve the accuracy of the classifier. The proposed system calculates the mean value of the most represented data samples in centroid of the record members; it selects the first most represented data sample that closest to the mean value calculates the minimum distance. The system Remove the duplicate dataset samples in the system and find the optimization solution to deduplication of records or data samples.


[1] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios, “Duplicate Record Detection: A Survey”, IEEE transactions on knowledge and data engineering, vol. 19, no. 1,January 2007.
[2] Gengxin Miao1 Junichi Tatemura2 Wang-Pin Hsiung2 Arsany Sawires2 Louise E. Moser11 ECE Dept., University of California, Santa Barbara, Santa Barbara, CA, 93106 2 NEC Laboratories America, 10080 N. Wolfe Rd SW3-350, Cupertino, CA, 95014, “Extracting Data Records from the Web Using Tag Path Clustering”.
[3] Imran R. This email address is being protected from spambots. You need JavaScript enabled to view it. IIT Bombay ,Sunita Sarawagi This email address is being protected from spambots. You need JavaScript enabled to view it. IIT Bombay, “Integrating unstructured data into relational databases”.
[4] Jaehong Min, Daeyoung Yoon, and Youjip Won,“Efficient Deduplication Techniques for Modern Backup Operation”, IEEE transactions on computers, vol. 60, no. 6, June 2011.
[5] B. Liu. Mining data records in Web pages. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, pages 601-606, 2003.
[6] B. Liu and Y. Zhai. NET: System for extracting Web data from °at and nested data records. In Proceedings of the Conference on Web Information Systems Engineering, pages 487-495, 2005.
[7] Moise´s G. de Carvalho, Alberto H.F. Laender, Marcos Andre´ Gonc¸alves, and Altigran S. da Silva “A Genetic Programming Approach to Record Deduplication” IEEE Transaction on knowledge and data engineering,vol.24, No.3, March 2012.

KeywordsExtracting data, identifying duplication, deduplication, genetic programming.