Poster
Alleviating Attention Bias for Visual-Informed Text Generation
MISO CHOI · Jinyoung Kim · Minseo Yoon · Ji Soo Lee · Hyunwoo Kim
Large Vision-Language Models (LVLMs) have shown remarkable performances in describing visual information with impressive linguistic ability, powering its diverse application. However, they often generate inaccurate descriptions of visual information, referred to as hallucination, therefore resolving this issue remains important for employing LVLMs in real-world scenarios. Although various approaches have been proposed in the literature, mitigating the hallucination in long-form generation remains challenging.We observed the Attention Bias phenomenon in LVLMs, where the model allocates a large amount of attention to a few specific tokens, regardless of inputs. With a thorough analysis of the correlation of Attention Bias with hallucination, we attribute the cause of hallucination to the internal attention mechanism of Transformers. To ALLEviate hallucination in text GenerATOR (ALLEGATOR), we propose Attention Moderator that refines the attention efficiently in the training stage and Attention Soft-Clipping to guarantee the stable distribution for generating visual-grounded text. We empirically show that our methods enable generating more accurate descriptions by adaptively referring visuals with sufficient attention. Allegator achieves significant improvements on hallucination benchmarks by up to 5.85% on Popular and 7.4% on Adversarial in POPE Precision, and decreases the percentage of hallucinated objects by up to 13.65% under challenging setting of long sequence generation.
Live content is unavailable. Log in and register to view live content