Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd Workshop on Touch Processing: From Data to Knowledge

Aligning Touch, Vision, and Language for Multimodal Perception

Max Fu · Gaurav Datta · Huang Huang · William Panitch · Jaimyn Drake · Joseph Ortiz · Mustafa Mukadam · Mike Lambeta · Roberto Calandra · Ken Goldberg


Abstract:

Touch, a crucial human sensing modality, has been absent from multimodal generative language models due to challenges in labeling tactile data. This work addresses this gap by leveraging the simultaneous collection of tactile and visual data, allowing GPT-4V to generate pseudo-labels from visual observations alone. The resulting dataset comprises 44K vision-touch pairs with English labels (10% human-annotated, 90% GPT-4V pseudo-labels). A touch-vision-language (TVL) model trained on this dataset shows improved tactile-vision-language alignment (+29% classification accuracy) over existing models and outperforms GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark.

Chat is not available.