CMU logo
Search
Expand Menu
Close Menu

HCII Thesis Proposal: Yi-Hao Peng

Open in new window

When
-

Where
Newell-Simon Hall 3305

Description

Collaborative Accessible AI Agents
Yi-Hao Peng
HCII PhD Thesis Proposal
 
Date & Time: Friday, November 21, 2025 @ 4-6PM ET
Location: NSH 3305
Remote: Zoom Link
 
 
Thesis Committee:
Jeff Bigham (Co-Chair) - CMU HCII & LTI
Amy Pavel (Co-Chair) - UC Berkeley EECS
Jodi Forlizzi - CMU HCII
Graham Neubig - CMU LTI
Ming-Hsuan Yang - Google DeepMind & UC Merced EECS
 
Abstract:
AI agents powered by large language and multimodal models show promise to automate tasks and operate complex user interfaces from natural language commands. These agents can help people fill out forms, order items, and create multimedia content, but current systems rarely ask for clarification when a request is ambiguous by default, often hallucinate confident but incorrect descriptions, and only roughly match human judgments when they compare complex visual options. For blind and low-vision (BLV) users, such behavior directly affects both information access and control over interfaces. Many BLV users rely on screen readers that present each interface as a slow, linear sequence and hide layout and visual structure, which creates three collaboration gaps. Users often do not know when an agent faces ambiguity, hidden defaults, or several equally valid actions, so the agent removes choice instead of surfacing it. Agents can also misread the screen or tools and still report confident results, while BLV users have few ways to verify what happened visually. Finally, many decisions are visual in nature, such as choosing layouts or styles of visual design, so BLV users need support to understand how options differ and which option better fits a goal.

In my dissertation, I develop collaborative accessible AI agents that address the aforementioned collaboration gaps in three directions. First, I design a proactive UI agent that pauses automation at key decision points, asks questions when goals or preferences are unclear, gives audio feedback about agent state, and highlights key differences between options so BLV users can choose. Second, I create methods that give agents reliable multimodal grounding by generating and annotating interfaces and slide environments from code. These methods teach both local structure and global visual context and keep actions and descriptions aligned with the actual interface states. Third, I build agents that explain how outcomes differ from original inputs and how alternatives compare in content, layout, style, and quality, so BLV users can understand and refine visual decisions. My work shows how collaborative agents that surface choice, stay grounded in the real interface, and explain contrasts between options can keep automation accessible, observable, and revisable for BLV users and move AI agents toward more genuinely collaborative roles with people.

Link to proposal document: