ANONYMIZE LARGE SCALE DATASETS USING MAPREDUCE ON CLOUD BASED ON TWO-PHASE TOP-DOWN SPECIALIZATION APPROACH

Data anonymization has been extensively studied and widely adopted for data privacy preservation in non inter active data publishing and sharing scenarios. Data anonymization refers to hiding identity and/or sensitive data for owners of data records. Then, the privacy of an individual can be effectively preserved while certain aggregate information is exposed to data users for diverse analysis and mining. The scale of data sets that need anonymizing in some cloud applications increases tremendously in accordance with the cloud computing and Big Data trends. Anonymizing data sets via generalization to satisfy certain privacy requirements such as k-anonymity is a widely used category of privacy preserving techniques. At present, the scale of data in many cloud applications increases tremendously in accordance with the Big Data trend, thereby making it a challenge for commonly used software tools to capture, manage, and process such large-scale data within a tolerable elapsed time. As a result, it is a challenge for existing anonymization approaches to achieve privacy preservation on privacy-sensitive large-scale data sets due to their insufficiency of scalability. In this paper propose a scalable two-phase top-down specialization (TDS) approach to anonymize large-scale data sets using the MapReduce framework on cloud. In both phases of our approach, we deliberately design a group of innovative MapReduce jobs to concretely accomplish the specialization computation in a highly scalable way. Experimental evaluation results demonstrate that with this approach, the scalability and efficiency of TDS can be significantly improved over existing approaches.