Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Video-Language Models

Mobile OS Task Procedure Extraction from YouTube

Yunseok Jang · Yeda Song · Sungryull Sohn · Lajanugen Logeswaran · Tiange Luo · Honglak Lee


Abstract:

We present MOTIFY, a novel approach for predicting scene transitions and actions from mobile operating system (OS) task videos. By leveraging pretrained Vision-Language Models (VLMs), MOTIFY extract the task sequences from real-world YouTube videos without manual annotation. Our method addresses the limitations of existing approaches, which rely on manual data annotation or simulation environments. We demonstrate MOTIFY's effectiveness on a diverse set of mobile OS tasks across multiple platforms, outperforming baseline methods in scene transition detection and action prediction. This approach opens new possibilities for scalable, real-world mobile agent development and video understanding research.

Chat is not available.