PDFS: Partially Dedupped File System for Primary Workloads

Primary storage dedup is difficult to be accomplished because of challenges to achieve low IO latency and high throughput while eliminating data redundancy effectively in the critical IO Path. In this paper, we design and implement the PDFS, a partially dedupped file system for primary workloads, which is built on a generalized framework using partial data lookup for efficient searching of redundant data in quickly chosen data subsets instead of the whole data. PDFS improves IO latency and throughput systematically by techniques including write path optimization, data dedup parallelization and write order preserving. Such design choices bring dedup to the masses for general primary workloads. Experimental results show that PDFS achieves 74-99 percent of the theoretical maximum dedup ratio with very small or even negative performance degradations compared with main stream file systems without dedup support. Discussions about varied configuring experiences of PDFS are also carried out.