Ferry: Toward Better Understanding of Input/Output Space for Data Wrangling Scripts
Zhongsu Luo - Zhejiang University, Hangzhou, China
Kai Xiong - Zhejiang University, Hangzhou, China
Jiajun Zhu - Zhejiang University, Hangzhou,Zhejiang, China
Ran Chen - Zhejiang University, Hangzhou, China
Xinhuan Shu - Newcastle University, Newcastle Upon Tyne, United Kingdom
Di Weng - Zhejiang University, Ningbo, China
Yingcai Wu - Zhejiang University, Hangzhou, China
Download camera-ready PDF
Room: Bayshore V
2024-10-16T17:57:00ZGMT-0600Change your timezone on the schedule page
2024-10-16T17:57:00Z
Fast forward
Keywords
Data wrangling, Visual analytics, Constraints, Program understanding
Abstract
Understanding the input and output of data wrangling scripts is crucial for various tasks like debugging code and onboarding new data. However, existing research on script understanding primarily focuses on revealing the process of data transformations, lacking the ability to analyze the potential scope, i.e., the space of script inputs and outputs. Meanwhile, constructing input/output space during script analysis is challenging, as the wrangling scripts could be semantically complex and diverse, and the association between different data objects is intricate. To facilitate data workers in understanding the input and output space of wrangling scripts, we summarize ten types of constraints to express table space and build a mapping between data transformations and these constraints to guide the construction of the input/output for individual transformations. Then, we propose a constraint generation model for integrating table constraints across multiple transformations. Based on the model, we develop Ferry, an interactive system that extracts and visualizes the data constraints describing the input and output space of data wrangling scripts, thereby enabling users to grasp the high-level semantics of complex scripts and locate the origins of faulty data transformations. Besides, Ferry provides example input and output data to assist users in interpreting the extracted constraints and checking and resolving the conflicts between these constraints and any uploaded dataset. Ferry’s effectiveness and usability are evaluated through two usage scenarios and two case studies, including understanding, debugging, and checking both single and multiple scripts, with and without executable data. Furthermore, an illustrative application is presented to demonstrate Ferry’s flexibility.