MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Subtle Clue Dynamics in Video Dialogues

10月 1, 2024·
Liyun Zhang
,
Zhaojie Luo Amd Shuqiong Wu
,
Yuta Nakashima
· 0 分で読める
概要
Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal emotion recognition capabilities, integrating multimodal cues from visual, acoustic, and linguistic contexts in the video to recognize human emotional states. However, existing methods have limitations. They neglect to capture the dynamics of local subtle clues in facial features and also do not leverage the contextual dependencies of the utterance-aware temporal segments in videos. These shortcomings somewhat restrict their expected effectiveness. In this work, we propose MicroEmo, a time-sensitive MLLM aimed at directing attention to the local facial clue dynamics and the contextual dependencies of utterance-aware video clips. Our model incorporates two key architectural contributions: (1) a global-local attention visual encoder that integrates global frame-level timestamp-bound image features with local facial features of temporal dynamics of subtle clues; (2) an utterance-aware video Q-Former that captures multi-scale and contextual dependencies by generating visual token sequences for each utterance segment and for the entire video then combining them. The experiments demonstrate that in a new Explainable Multimodal Emotion Recognition (EMER) task that exploits multi-modal and multi-faceted clues to predict emotions in an open-vocabulary (OV) manner, MicroEmo demonstrates its effectiveness compared with the latest methods.
タイプ
収録
Proc. 2nd International Workshop on Multimodal and Responsible Affective Computing