Skip to yearly menu bar Skip to main content


Spotlight
in
Workshop: Multimodal Algorithmic Reasoning Workshop

ENTER: Event Based Interpretable Reasoning for VideoQA

Hammad Ayyubi · Junzhang Liu · Zhecan Wang · Hani Alomari · Chia-Wei Tang · Ali Asgarov · Md. Atabuzzaman · Najibul Haque Sarker · Zaber Hakim · Shih-Fu Chang · Chris Thomas

[ ]
Sun 15 Dec 12:10 p.m. PST — 12:15 p.m. PST
 
presentation: Multimodal Algorithmic Reasoning Workshop
Sun 15 Dec 8:25 a.m. PST — 5:05 p.m. PST

Abstract:

In this paper, we present an interpretable system for event-based video question answering that combines visual input processing with transparent reasoning. Existing top-down reasoning methods emphasize interpretability but often disregard visual data, while bottom-up approaches directly produce responses from visual data, but but lack interpretability into their decision making. Our proposed system, ENTER, bridges this gap by representing videos as event graphs, where nodes represent individual events and edges capture their hierarchical, temporal, or causal relationships. This structured representation allows us to generate interpretable code using large language models that operates directly on the event graphs to answer complex questions, including those with long-range dependencies. Experimental results on NextQA demonstrate that our method outperforms existing top-down approaches, while obtaining competitive performance against bottom-up approaches, while offering superior interpretability and explainability in the reasoning process.

Chat is not available.