Abstract: Visual reasoning – the ability to interpret the visual world–is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models ...