Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Video-Language Models

Exploring In-Context Ensemble with Video-Language Models for Low-Level Workflow Understanding

Moucheng Xu · Evangelos Chatzaroulas · Luc McCutcheon · Abdul Ahad · Hamzah Azeem · Janusz Marecki · Ammar Anwar


Abstract:

A Standard Operating Procedure (SOP) defines a step-by-step written guide for a business software workflow based on a video demonstration. SOPs are a crucial step toward automating end-to-end software workflows. Manually creating SOPs can be time-consuming. Recent advancements in large video-language models offer the potential for automating SOP generation by analyzing recordings of human demonstrations. However, current large video-language models face challenges with zero-shot SOP generation. We explore in-context learning with video-language models for SOP generation. We report that in-context learning sometimes helps video-language models at SOP generation. We then propose an in-context ensemble learning to further enhance the capabilities of the models in SOP generation.

Chat is not available.