CMU logo
Expand Menu
Close Menu

Robotics Thesis Defense

Speaker
MENGTIAN (MARTIN) LI
Ph.D. Student
Robotics Institute
Carnegie Mellon University

When
-

Where
Virtual Presentation - ET

Description

We have witnessed rapid advancement across major computer vision benchmarks over the past years. However, the top solutions' hidden computation cost prevents them from being practically deployable. For example, training large models until convergence may be prohibitively expensive in practice, and autonomous driving or augmented reality may require a reaction time that rivals that of humans, typically 200 milliseconds for visual stimuli. Clearly, vision algorithms need to be adjusted or redesigned when meeting resource constraints. This thesis argues that we should embrace resource constraints into the first principles of algorithm designs. We support this thesis with principled evaluation frameworks and novel constraint-aware solutions for various computer vision tasks. This thesis first investigates the evaluation of vision algorithms in resource-constrained settings. As a primary metric for computation cost, latency is mainly evaluated independently from accuracy, making it hard to compare algorithms with different accuracy-latency tradeoffs. To address this issue, we propose an approach to integrate latency and accuracy coherently into a single metric that we call "streaming accuracy". We further show that we can build an evaluation framework on top of this metric and generalize it to arbitrary single-frame understanding tasks. Such streaming perception framework yields several surprising conclusions and solutions, e.g., latency is sometimes minimized by sitting idle and "doing nothing"! We also discuss future extensions of streaming perception to streaming forecasting, where the evaluation protocol is one step closer to real-world applications with full-stack perception. Additionally, we propose a formal setting for studying generic deep network training under the non-asymptotic, resource-constrained regime, i.e., budgeted training. This thesis then explores task-specific novel solutions under resource constraints. Far-range LiDAR-based 3D object detection is a compute-intensive task. Contemporary solutions use 3D voxel representations, often encoded with a bird's-eye view (BEV) feature map. While quite intuitive, such representations scale quadratically with the spatial range of the map, making them ill-suited for far-field perception. We present a pyramidal representation that retains the benefits of BEV while remaining efficient by exploiting the following insight: near-field lidar measurements are dense and optimally encoded by small voxels, while far-field measurements are sparse and better encoded with large voxels. Additionally, this thesis proposes biologically-inspired attentional warping for 2D object detection, and discusses its future extension to arbitrary image-based tasks. We also propose a progressive distillation approach for learning lightweight detectors from a sequence of teacher models. To complete the perception stack, we propose future object detection with backcasting for end-to-end detection, tracking, and forecasting.

Thesis Committee: Deva Ramanan (Chair) Martial Hebert Mahadev Satyanarayanan Raquel Urtasun (Waabi & University of Toronto) Ross Girshick (Meta AI Research) Additional Information

Zoom Participation. See announcement.