Data Cleaning using a Matching Dependency Technique

Jain, Shashank

Data Cleaning using a Matching Dependency Technique

Files

Jain_Shashank_MSc_2018.pdf (1.45 MB)

Date

2019-01-02

Authors

Jain, Shashank

Abstract

In today’s digital society, people are often required to enter their home or office addresses on forms available online. It is not uncommon for people to introduce some minor mistakes, such as misspelled addresses, or incorrect postal codes/zip codes. Such mistakes made by the user can be quite problematic when automated systems must process their request. For example, if a person orders something online providing the incorrect postal code in the entered address, this mistake could lead to delay in the delivery of the item or even worse, the item may remain undelivered. To avoid such situations, these systems often use a machine learning technique called ‘Matching Dependency’ which has been proven helpful in making recommendations for the correction of any incorrect value in the input address. This technique uses a binary search algorithm to reduce the number of cycles the process has to go through to make recommendations. Our exploration of one possible implementation of this algorithm uses our own synthesized sample data sets instead of real user input with the external data. External data has been used as the authenticated data source to verify the user input data. We compare our synthesized user input data with the external data that is considered to be completely trust worthy. The system then makes possible recommendations based on the correctness of the user input. The evaluation was mainly done on two different sizes of data sets, 1000 and 15000. The results had zero false negatives, few false positives, and mostly relevant recommendations.

URI

http://hdl.handle.net/1828/10477

Collections

Master's Projects

Full item page

Data Cleaning using a Matching Dependency Technique

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections