A Needle in a Haystack: Finding Contextual Knowledge for Video Question Answering

Nov 1, 2025·

Shunsuke Ichimiya

Yuta Nakashima

· 0 min read

Abstract

We present a new video question answering (VideoQA) task, coined MAGQA, based on an existing VideoQA dataset, which requires retrieving relevant information from a given knowledge base. This task can be seen as a proxy of question answering about one’s daily life, where the historical record of their activities is available, e.g., as an ego-centric video or vlog. Retrieval from the knowledge base is the key to the task, while annotations for training the retriever are practically unavailable. We propose a method, called MAGNet, with two-stage training, which leverages large language models (LLMs) to address this task. The first stage pre-trains the knowledge base retriever with pseudo-questions. The second stage further fine-tunes the retriever using the feedback from LLMs on the retrieval results. Our experimental results show that accurate knowledge retrieval is crucial in MAGQA, and that MAGNet achieves state-of-the-art performance even in the absence of annotations for retriever training.

Type

Conference paper

Publication

Proc. the Asian Conference on Pattern Recognition (ACPR2025)

Last updated on Nov 1, 2025

Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation Oct 1, 2025 →