I just read a paper by Lin Duan, Yanming Xiu, and Maria Gorlatova discussing how Vision-Language Models (VLMs) might help evaluate AR-generated scenes. It’s fascinating because while VLMs like GPT, Gemini, and Claude can identify AR scenes, they seem to struggle with more complex, seamlessly integrated content. Has anyone else read this? What do you think about using VLMs for this purpose?
This sounds interesting. I wonder how they determine the effectiveness of VLMs in this context.
I find it cool that they achieved a True Positive Rate of 93% for perception. Do you think this is high enough for practical applications?