Data Cleaning using a Matching Dependency Technique
Date
2019-01-02
Authors
Jain, Shashank
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In today’s digital society, people are often required to enter their home or office addresses on forms available online. It is not uncommon for people to introduce some minor mistakes, such as misspelled addresses, or incorrect postal codes/zip codes. Such mistakes made by the user can be quite problematic when automated systems must process their request. For example, if a person orders something online providing the incorrect postal code in the entered address, this mistake could lead to delay in the delivery of the item or even worse, the item may remain undelivered. To avoid such situations, these systems often use a machine learning technique called ‘Matching Dependency’ which has been proven helpful in making recommendations for the correction of any incorrect value in the input address. This technique uses a binary search algorithm to reduce the number of cycles the process has to go through to make recommendations. Our exploration of one possible implementation of this algorithm uses our own synthesized sample data sets instead of real user input with the external data. External data has been used as the authenticated data source to verify the user input data. We compare our synthesized user input data with the external data that is considered to be completely trust worthy. The system then makes possible recommendations based on the correctness of the user input.
The evaluation was mainly done on two different sizes of data sets, 1000 and 15000. The results had zero false negatives, few false positives, and mostly relevant recommendations.