Vision and Language
Since the advent of deep learning, Vision and Language—that is, research dealing with vision and natural language—has become one of the central topics in both computer vision and natural language processing. Understanding the meaning of images and videos, and being able to express that meaning in natural language, are considered to be closely related. Below, we introduce several examples of our laboratory’s work (some parts were translated into Japanese by ChatGPT).
Explain Me the Painting: Generating Descriptions for Paintings
Have you ever looked at a painting and wondered, “What kind of story lies behind this work?” In this research, we propose a framework for generating descriptions of fine art paintings in order to deepen the understanding of artworks and make art more accessible to people. Even with current AI technology, generating informative descriptions for artworks remains difficult. This is because doing so requires understanding and describing multiple aspects of a work, such as its style, content, and composition, while also adding knowledge about the painter, artistic influences, and historical background.
In this work, we introduce a multi-topic and knowledge-based framework. This framework structures the generated text around three artistic topics and further enriches each description by leveraging external knowledge. Our framework achieved strong results in both quantitative and qualitative evaluations, as well as in human comparative evaluations, in terms of both topic diversity and information accuracy.
For details and code, please visit this page.
The Problem of Spurious Correlations in Evaluating Moment Retrieval Performance
Moment retrieval with natural language queries is the task of identifying and extracting the segment of a video that corresponds to a given query. Because it requires understanding the semantics of both natural language and video, it is an extremely challenging task. As with many other tasks in computer vision and machine learning, progress in moment retrieval has been supported by benchmark datasets, and therefore the quality of those datasets has a major impact on the entire research community working on this task.
In moment retrieval, as in many other tasks, various models have been proposed, and benchmark rankings have been continuously updated. In this study, we experimentally examine how accurately these benchmark results reflect the true capabilities of the models. If a benchmark fails to evaluate models properly, that is a serious problem. Our experimental results revealed that widely used benchmark datasets contain substantial biases, and that state-of-the-art models at the time appeared to exhibit behavior suggesting that they were exploiting these biases.
In addition, this study proposes new sanity-check experiments and approaches for visually understanding the results, as well as directions for improving the evaluation methodology of moment retrieval.
For details and code, please visit this page.
A Dataset for Question Answering about Paintings
Answering questions about artworks, especially paintings, is a difficult challenge for artificial intelligence. This is because, in many cases, asking a question about a painting requires not only understanding the visual information depicted in the work, but also understanding the contextual knowledge about the painting that is acquired through the study of art history.
In this study, as an initial attempt toward constructing a new dataset for question answering about art, we introduce a dataset called AQUA (Art QUestion Answering). The question-answer (QA) pairs in this dataset are automatically generated using state-of-the-art question generation techniques based on paintings and comments contained in existing art understanding datasets. The generated QA pairs are then cleaned through crowdsourcing according to criteria such as grammatical correctness, answerability of the question, and correctness of the generated answer, resulting in a high-quality dataset. This dataset includes both visual questions (based on paintings) and knowledge-based questions (based on comments).
Furthermore, we also propose baseline models that independently handle visual questions and knowledge-based questions. In this work, we compare these baseline models with state-of-the-art models in visual question answering and comprehensively discuss the challenges and future possibilities of question answering in the domain of art.