Hyper-v

Francis Engelmann: Towards High-Fidelity Open-Vocabulary 3D Scene Understanding



Abstract: In this talk, I will present some of our recent work on Open-Vocabulary 3D Scene Understanding. We start with Transformer-based networks and demonstrate their use as general-purpose models for a variety of 3D scene understanding tasks, including object segmentation, human body part segmentation, and vectorized floorplan reconstruction. However, despite their impressive ability to solve diverse tasks, these fully-supervised models typically fail when applied to “in-the-wild” scenes. This motivates the necessity for open-vocabulary approaches that can operate in real-world scenarios. I will then present recent works for open-vocabulary 3D scene segmentation, making use of foundation models such as CLIP and SAM. Even though the use of foundation models is revolutionizing the field in an exciting way, we will also look at current limitations and open challenges.

Short Bio: Francis Engelmann is a PostDoc with Prof. Marc Pollefeys at ETH Zurich, and a visiting researcher at Google with Federico Tombari. His research interest lie at the intersection of computer vision and 3D scene understanding. Francis is a Fellow of the ETH AI Center, the ELLIS Society, and the recipient of the ETHZ Career Seed Award.

[ad_2]

source

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button