Exploring Open-Vocabulary Scene Understanding in XR... anyone tried it?

Hey everyone! I wanted to share our recent work, OpenMaskXR, which focuses on open-vocabulary scene understanding in extended reality. We demonstrate how commodity XR headsets can identify object instances based on user queries in natural language. This goes beyond the classification of a fixed set of objects and allows for more dynamic interactions. You can check out our video here: https://youtu.be/rDraLkbDRW0. I’m curious if anyone has seen similar advancements in the industry. Happy to answer any questions!

Very interesting, I opened 2 issues on GitHub regarding the app and the WebXR client so that I can test better. Thanks for sharing.

This sounds like a game changer for XR development. How do you handle the natural language processing aspect?

I love the idea of making computing invisible in XR. It’s about time we rethink UI. Any examples already in the industry?

What hardware do you recommend for testing OpenMaskXR? I’m curious if it works with lower-end devices.

I’m looking forward to trying this out. Any tips for getting started with your setup?