HCII PhD Thesis Proposal: Luke Guerdan

When
June 2, 2025 10:00am - June 2, 2025 12:00pm

Description

Principled Measurement and Evaluation of AI Systems Under Practical Constraints

Luke Guerdan

HCII PhD Thesis Proposal

Date & Time: Monday, June 2nd @ 10:00 a.m. ET
Location: Newell Simon Hall (NSH) 3305
Remote: Zoom Link (Meeting ID: 914 4855 0148; Passcode: 514044)

Committee:
Ken Holstein (co-chair), Carnegie Mellon University
Steven Wu (co-chair), Carnegie Mellon University
Jeffrey Bigham, Carnegie Mellon University
Alexandra Chouldechova, Microsoft Research

Abstract:

As artificial intelligence (AI) systems are introduced in a range of socially consequential settings, reliably measuring their capabilities, risks, and limitations has become crucial. Yet many properties of AI systems — such as the "helpfulness" of a chatbot response, or the "fairness" of a predictive algorithm, are unobservable latent constructs. Obtaining valid, reliable, and informative measurements of such constructs is challenging in practice, given both their contested nature and the often limited resources available to measure them effectively. In response, this thesis introduces the concept of a measurement intervention: a targeted approach designed to help organizations improve the validity of AI system performance measurements under practical constraints.

I first explore measurement interventions in the context of predictive modeling for algorithmic decision support (ADS). I propose a conceptual framework that characterizes threats to the validity of predictive models used in ADS. I then report evidence from qualitative interviews with data scientists, illustrating the rich and adaptive trade-offs they make while balancing the validity of a predictive modeling formulation with competing desiderata, such as its resource requirements and portability across institutional contexts. Based on these findings, I devise two statistical approaches – "expert anchors" and "uncertainty cancellation" – that help teams more effectively repurpose existing organizational data for new measurement tasks.

The second portion of this thesis, to be expanded in proposed work, advances measurement interventions for generative AI (GenAI) evaluation. I identify simple modifications to the design of rating instructions – such as clarifying how raters should resolve ambiguous evaluation criteria – that can enable organizations to more effectively validate automated rating systems using limited and costly human ratings. My proposed work will advance GenAI evaluation on two fronts. First, I will work with AI practitioners to develop a toolkit that helps teams assess and improve the construct validity of evaluations, by helping them examine validity sub-criteria through interactive visualization modules. Second, I will develop a statistical framework that leverages doubly-robust estimation to improve the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment settings.

In sum, this thesis advances the conceptual and statistical foundations for measurement interventions: practical tools to help organizations rigorously evaluate AI systems under practical constraints.

Proposal Document: https://drive.google.com/file/d/1RlgOFKippL0nV2HZUi2aP2YUA6ew0nra/view?usp=sharing

Best,
Luke