Poster
in
Workshop: Synthetic Data Generation with Generative AI
Evaluating VLMs for Property-Specific Annotation of 3D Objects
Rishabh Kabra · Loic Matthey · Alexander Lerchner · Niloy Mitra
Keywords: [ vision language models ] [ physical properties ] [ semantic annotation ] [ 3d objects ]
3D objects, which often lack clean text descriptions, present an opportunity to evaluate pretrained vision language models (VLMs) on a range of annotation tasks---from describing object semantics to physical properties. An accurate response must take into account the full appearance of the object in 3D, various ways of phrasing the question/prompt, and changes in other factors that affect the response. We present a method, to marginalize over arbitrary factors varied across VLM queries, which relies on the VLM’s scores for sampled responses. We first show that this aggregation method can outperform a language model (e.g., GPT4) for summarization, for instance avoiding hallucinations when there are contrasting details between responses. Secondly, we show that aggregated annotations are useful for prompt-chaining; they help improve downstream VLM predictions (e.g., of object material when the object’s type is specified as an auxiliary input in the prompt). Such auxiliary inputs allow ablating and measuring the contribution of visual reasoning over language-only reasoning. Using these evaluations, we show that VLMs approach the quality of human-verified annotations on both type and material inference on the large-scale Objaverse dataset.