A Needle in a Haystack: Finding Contextual Knowledge for Video Question Answering

Nov 1, 2025·
Shunsuke Ichimiya
,
Yuta Nakashima
· 0 min read
Abstract
We present a new video question answering (VideoQA) task, coined MAGQA, based on an existing VideoQA dataset, which requires retrieving relevant information from a given knowledge base. This task can be seen as a proxy of question answering about one’s daily life, where the historical record of their activities is available, e.g., as an ego-centric video or vlog. Retrieval from the knowledge base is the key to the task, while annotations for training the retriever are practically unavailable. We propose a method, called MAGNet, with two-stage training, which leverages large language models (LLMs) to address this task. The first stage pre-trains the knowledge base retriever with pseudo-questions. The second stage further fine-tunes the retriever using the feedback from LLMs on the retrieval results. Our experimental results show that accurate knowledge retrieval is crucial in MAGQA, and that MAGNet achieves state-of-the-art performance even in the absence of annotations for retriever training.
Type
Publication
Proc. the Asian Conference on Pattern Recognition (ACPR2025)