CMU logo
Search
Expand Menu
Close Menu

HCII PhD Thesis Defense: Will Epperson

Open in new window

When
-

Description

Interactive Data Profiling

Will Epperson

HCII PhD Thesis Defense

Date & Time: Tuesday, June 3rd, 10 am ET

Location: NSH 4305
 

Zoomhttps://cmu.zoom.us/j/96890910604?pwd=xmbcCTVa1YThch03C6EdDaJ1PfSatK.1&jst=2 

Meeting ID: 96890910604, Passcode: 882527

Thesis Committee

Adam Perer (co-chair), CMU

Dominik Moritz (co-chair), CMU & Apple

Sherry Tongshuang Wu, CMU

Aniket Kittur, CMU

Gagan Bansal, Microsoft Research

Abstract

Data has been a key driver behind recent advances in science, engineering, and artificial intelligence. As datasets have grow larger and more complex, the primary bottleneck has shifted from access to data towards the human effort required to interpret it. Human expertise is essential to understand datasets, however generating this understanding during analysis remains a time-consuming and manual process. Many AI modeling failures are, at their core, data problems—issues that might have been addressed earlier with better tools for understanding the data. Data visualization facilitates understanding through visual representations, however existing approaches to visual data exploration introduce friction that slows users down, requiring manually defining charts and interactions through code or context switching to a new analysis tool. How can we build flexible and lightweight systems to help people more quickly understand their data?
 

This thesis develops systems for Interactive Data Profiling that accelerate data exploration through lightweight and interactive interfaces that fit into current analysis workflows. We first motivate this problem through a large-scale interview study and survey of data scientists that reveals the potential for tools to help users manage the repetitive code used for data profiling. We then discuss the design, implementation, and evaluation of three systems that develop the approach of interactive data profiling. First, we describe AutoProfiler, a system that augments programming environments with automatic data profiles that show summaries of the data in memory that update as a user programs. We then extend this approach with Solas which tracks the history of a user’s analysis code to create data profiles adapted to the current task and user interest. User evaluations demonstrate how the lightweight visualizations and fast feedback loops enabled by these systems help users quickly identify important patterns and data quality issues. Finally, we present Texture, a general-purpose text exploration tool that enables users to iterate on attributes for describing their text and then explore results in the interactive UI. Expert user studies show how Texture enables more efficient exploration and helps users uncover new insights from their text datasets.
 

Together, these tools establish how to situate interactive data profiling within data science workflows to enable a fast feedback loop between manipulating data and inspecting the results. As data remains an increasingly important component of modern work, interactive data profiling systems can play a critical role in enabling faster, more reliable understanding of the data behind models and decisions.

Thesis Document

https://willepperson.com/thesis.pdf 

Best,

Will Epperson