Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning

LangDA: Adapting Visual Features with Instruction Tuning for Semantic Segmentation

Chang Liu · Saad Hossain · C Thomas · Kwei-Herng Lai · Raviteja Vemulapalli · Sirisha Rambhatla · Alexander Wong


Abstract: Existing unsupervised domain adaptation for semantic segmentation (DASS) methods mitigate visual domain shifts via image style, content, and context. However, despite the recent advancement in large vision-language foundational models (LLVMs), the role of LLVMs in bridging domain gaps remains under-explored in the context of DASS. This paper proposes a novel language-guided adaptation approach (LangDA) where we leverage visual instruction tuning to augment source data with local image-level textual description. LangDA utilizes a CLIP-based pre-trained contrastive vision-language model to help steer source and target image features toward their domain-invariant textual representation. This is accomplished via a large vision language captioning model and by introducing an additional language objective on top of unsupervised and supervised losses. To the best of our knowledge, this is the first work that utilizes text to align vision domains in unsupervised domain adaptation for semantic segmentation (DASS). The proposed prompt-driven UDA approach achieves 62.0\% mean Jaccard index on the standard Synthia $\to$ Cityscapes benchmark dataset, outperforming the state-of-the-art by 0.9\% with negligible parameter overheads.

Chat is not available.