A Needle in a Haystack: Finding Contextual Knowledge for Video Question Answering

11月 1, 2025·

Shunsuke Ichimiya

Yuta Nakashima

· 0 分で読める

概要

We present a new video question answering (VideoQA) task, coined MAGQA, based on an existing VideoQA dataset, which requires retrieving relevant information from a given knowledge base. This task can be seen as a proxy of question answering about one’s daily life, where the historical record of their activities is available, e.g., as an ego-centric video or vlog. Retrieval from the knowledge base is the key to the task, while annotations for training the retriever are practically unavailable. We propose a method, called MAGNet, with two-stage training, which leverages large language models (LLMs) to address this task. The first stage pre-trains the knowledge base retriever with pseudo-questions. The second stage further fine-tunes the retriever using the feedback from LLMs on the retrieval results. Our experimental results show that accurate knowledge retrieval is crucial in MAGQA, and that MAGNet achieves state-of-the-art performance even in the absence of annotations for retriever training.

タイプ

学会論文

収録

Proc. the Asian Conference on Pattern Recognition (ACPR2025)

最終更新 11月 1, 2025

Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation 10月 1, 2025 →