A Needle in a Haystack: Finding Contextual Knowledge for Video Question Answering
概要
We present a new video question answering (VideoQA) task, coined MAGQA, based on an existing VideoQA dataset, which requires retrieving relevant information from a given knowledge base. This task can be seen as a proxy of question answering about one’s daily life, where the historical record of their activities is available, e.g., as an ego-centric video or vlog. Retrieval from the knowledge base is the key to the task, while annotations for training the retriever are practically unavailable. We propose a method, called MAGNet, with two-stage training, which leverages large language models (LLMs) to address this task. The first stage pre-trains the knowledge base retriever with pseudo-questions. The second stage further fine-tunes the retriever using the feedback from LLMs on the retrieval results. Our experimental results show that accurate knowledge retrieval is crucial in MAGQA, and that MAGNet achieves state-of-the-art performance even in the absence of annotations for retriever training.
タイプ
収録
Proc. the Asian Conference on Pattern Recognition (ACPR2025)