Spotlight
in
Workshop: Multimodal Algorithmic Reasoning Workshop
ENTER: Event Based Interpretable Reasoning for VideoQA
Hammad Ayyubi · Junzhang Liu · Zhecan Wang · Hani Alomari · Chia-Wei Tang · Ali Asgarov · Md. Atabuzzaman · Najibul Haque Sarker · Zaber Hakim · Shih-Fu Chang · Chris Thomas
Sun 15 Dec 8:25 a.m. PST — 5:05 p.m. PST
In this paper, we present an interpretable system for event-based video question answering that combines visual input processing with transparent reasoning. Existing top-down reasoning methods emphasize interpretability but often disregard visual data, while bottom-up approaches directly produce responses from visual data, but but lack interpretability into their decision making. Our proposed system, ENTER, bridges this gap by representing videos as event graphs, where nodes represent individual events and edges capture their hierarchical, temporal, or causal relationships. This structured representation allows us to generate interpretable code using large language models that operates directly on the event graphs to answer complex questions, including those with long-range dependencies. Experimental results on NextQA demonstrate that our method outperforms existing top-down approaches, while obtaining competitive performance against bottom-up approaches, while offering superior interpretability and explainability in the reasoning process.