Poster
Unleash Region Understanding in Intermediate Layers for MLLM-based Referring Expression Generation
Yaoyuan Liang · Zhuojun Cai · Jian Xu · Guanbo Huang · Yiran Wang · Xiao Liang · Jiahao Liu · Ziran Li · Jingang Wang · Shao-Lun Huang
This paper studies the Multi-modal Large Language Model (MLLM) based Referring Expression Generation (REG) task, which aims to generate an unambiguous text description that applies to exactly one object or region in the image. MLLM-based REG models naturally inherit the hallucination issues from MLLM. We found there is a trade-off between the detailed description and accurate targeting of referring objects. Providing precise object descriptions necessitates generating sentences with more details, which will also increase the probability of the introducing of hallucinations. To address this issue, we proposed a training-free method, named as ``unleash-then-eliminate'', that first elicits the latent information in intermediate layers, then adopts a cycle-consistency-based decoding method to alleviate the production of hallucinations. We conduct extensive experiments on the RefCOCOg and PHD benchmarks, outperforming existing methods on both semantic and hallucination-related metrics, demonstrating its effectiveness for this task.\footnote{Code will be made available in \url{https://anonymous.4open.science/status/NeurIPS24-DE0B}.}
Live content is unavailable. Log in and register to view live content