HCI Thesis Proposal: Will Epperson
When
-
Description
Interactive Data Profiling Systems for Data Programming
Will Epperson
HCII PhD Thesis Proposal
Time & Location
Wednesday, January 22nd at 2:00PM EST
Newell Simon Hall (NSH) 3305
Zoom: https://cmu.zoom.us/j/
Committee
Adam Perer (co-chair), CMU
Dominik Moritz (co-chair), CMU & Apple
Sherry Tongshuang Wu, CMU
Aniket Kittur, CMU
Gagan Bansal, Microsoft Research
Abstract
Data is the backbone of many of the recent advances in AI and the growing emphasis on data-driven decision making. To leverage data effectively, it must be understood, modeled, cleaned, and presented—an iterative process that relies on both programming and GUI-based tools. While powerful tools exist for authoring and understanding code, the process of iteratively understanding the data that code operates on is still cumbersome. Effective data understanding goes beyond merely understanding the code used to manipulate data; it involves analyzing data summaries, tracking how data evolves throughout analysis, and identifying potential issues and insights.
To address this problem, this thesis contributes the design of interactive data profiling systems for data programming. These tools help data workers understand their data through three design principles: designing for data programming workflows, providing automatic data profiles, and enabling data interaction. Interactive data profiling tools are designed for data programming workflows to facilitate fast feedback throughout the data analysis process while programming with data. This data feedback takes the form of automatic data profiles that show concise visual summaries of the dataset to help users perform exploratory data analysis on the data. Finally, interactive data profiling tools are designed for iterative workflows where users view an initial data overview and then can express follow up questions through code and GUI interactions such as filtering to data subsets, exploring anomalies, or editing the data.
This thesis introduces the design and evaluation of four systems for interactive data profiling across tabular and unstructured data that implement these design principles. The first two systems, AutoProfiler and Solas, target tabular data in computational notebook environments. These tools present mechanisms for lightweight data summaries that enable users to make sense of data results as they program in a notebook. The latter two systems consider interactive profiling while programming with unstructured datasets that require additional processing to produce profiling visualizations. AgDebugger enables users to profile and debug AI agent conversations by interactively editing agent messages to fix errors and steer the agents towards successful outcomes. Finally, in my proposed work I am developing and evaluating Texture, a general-purpose text profiling tool that allows users to derive structured representations of their text through code and then profile and explore the results in an interactive interface. I discuss my planned evaluation of Texture, where I will measure its effectiveness through expert case studies on diverse text analysis tasks.
Best,
Will Epperson