INCREMENTAL DETECTION IN DISTRIBUTED DATA

Real life data is often dirty. To clean the data, efficient algorithms for detecting errors have to be in place. Errors in the data are typically detected as violations of constraints (data quality rules), such as functional dependencies (FDs), denial constraints, and conditional functional dependencies (CFDs). When the data is in a centralized database, it is known that two SQL queries suffice to detect its violations of a set of CFDs. This abstract investigates incremental detection of errors in distributed data. Given a distributed database D, a set of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates Delta D to D, it is to find, with minimum data shipment, changes Delta V to V in response to Delta D. The need for the study is evident since real-life data is often dirty, distributed and frequently updated. It is often prohibitively expensive to recompute the entire set of violations when D is updated. We show that the incremental detection problem is NP-complete for database D that is partitioned either vertically or horizontally, even when Summation and D are fixed. Nevertheless, we show that it is bounded: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of Delta D and Delta V, independent of the size of the database D .We provide such incremental algorithms for vertically partitioned data and horizontally partitioned data, and show that the algorithms are optimal. We further propose optimization techniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts.