HCII Thesis Proposal: Yi-Hao Peng
When
-
Where
Newell-Simon Hall 3305
Description
AI agents powered by large language and multimodal models show promise to automate tasks and operate complex user interfaces from natural language commands. These agents can help people fill out forms, order items, and create multimedia content, but current systems rarely ask for clarification when a request is ambiguous by default, often hallucinate confident but incorrect descriptions, and only roughly match human judgments when they compare complex visual options. For blind and low-vision (BLV) users, such behavior directly affects both information access and control over interfaces. Many BLV users rely on screen readers that present each interface as a slow, linear sequence and hide layout and visual structure, which creates three collaboration gaps. Users often do not know when an agent faces ambiguity, hidden defaults, or several equally valid actions, so the agent removes choice instead of surfacing it. Agents can also misread the screen or tools and still report confident results, while BLV users have few ways to verify what happened visually. Finally, many decisions are visual in nature, such as choosing layouts or styles of visual design, so BLV users need support to understand how options differ and which option better fits a goal.
In my dissertation, I develop collaborative accessible AI agents that address the aforementioned collaboration gaps in three directions. First, I design a proactive UI agent that pauses automation at key decision points, asks questions when goals or preferences are unclear, gives audio feedback about agent state, and highlights key differences between options so BLV users can choose. Second, I create methods that give agents reliable multimodal grounding by generating and annotating interfaces and slide environments from code. These methods teach both local structure and global visual context and keep actions and descriptions aligned with the actual interface states. Third, I build agents that explain how outcomes differ from original inputs and how alternatives compare in content, layout, style, and quality, so BLV users can understand and refine visual decisions. My work shows how collaborative agents that surface choice, stay grounded in the real interface, and explain contrasts between options can keep automation accessible, observable, and revisable for BLV users and move AI agents toward more genuinely collaborative roles with people.
