🌏 WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

CVPR 2026

Woongyeong Yeo^*¹ Kangsan Kim^*¹ Jaehong Yoon^†² Sung Ju Hwang^†¹^,³

¹KAIST ²Nanyang Technological University ³DeepAuto.ai

* : equal contribution, † : equal advising

Paper arXiv Code

Concept figure — A day-long video sampled at 1 fps has frames that exceed the context limits of video LLMs.

M3-Agent [1] relies on textual representation of video, which can underrepresent visual information.

EgoRAG [2] retrieves both captions and the corresponding visual frames, but irrelevant frames may distract model.

WorldMM (Ours) constructs multiple memories, incorporating both textual and visual representations, and uses adaptive memory retrieval to effectively leverage multimodal information.

Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations.

To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.

Method

Multimodal Memory Construction

WorldMM builds a unified multimodal memory system that captures complementary aspects of long videos, enabling reasoning across diverse modalities and timescales:

Episodic Memory: Multi-scale textual event graphs built from captions of fine-to-coarse video segments, allowing both detailed and long-range temporal understanding.
Semantic Memory: A continuously updated knowledge graph that accumulates high-level relationships and habits across the full video.
Visual Memory: A hybrid visual store combining feature embeddings for keyword-based semantic search and timestamped frames for precise visual grounding.

Adaptive Memory Retrieval

The retrieval agent iteratively decides which memory to access and what query to issue, conditioned on the user question and retrieval history. At each step, it evaluates the information gathered so far and identifies what additional evidence is needed. It then retrieves from the selected memory using a targeted search query. Through successive iterations, the agent progressively refines its retrieval strategy and accumulates a more precise and comprehensive set of knowledge across diverse multimodal memories.

Response Generation

Once sufficient information is collected, a response agent synthesizes the retrieved evidence and the reasoning history to produce a final, grounded answer.

Case Study

Episodic memory alone is often insufficient for capturing the detailed visual context required for accurate reasoning. When the model relies solely on episodic memory, it overlooks fine-grained object attributes, such as the specific type of baked item, which leads to incorrect predictions. By contrast, the retrieval agent can dynamically retrieve from visual memory, retrieving the corresponding video frames. By incorporating the visual context, the model can precisely interpret objects, activities, and their fine-grained characteristics.
Episodic memory struggles to represent patterns or behaviors that extend beyond an individual event, failing to capture habitual behaviors, such as what is regularly used to wipe kitchenware. The retrieval agent addresses this limitation by dynamically retrieving from semantic memory, which encodes long-term, repeated behaviors accumulated across episodes. This access to cumulative, habitual knowledge allows the model to perform more robust long-term reasoning, even when individual episodes lack explicit evidence.

Together, they highlight the complementary roles of multimodal memories: visual memory provides perceptual detail, while semantic memory encodes the high-level knowledge about relationships and habits. The retrieval agent’s ability to dynamically retrieve from these memories enables more accurate and contextually grounded reasoning than episodic memory alone.

Results

Table 1. Performance of WorldMM with various baselines across long video QA benchmarks.

Table 2. Performance of WorldMM across multiple benchmarks using different memory types. E, S, and V denote episodic, semantic, and visual memories, respectively. Combinations with “+” indicate multiple memory types are used.

References

[1] Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory, Long et al., arXiv 2025.

[2] EgoLife: Towards Egocentric Life Assistant, Yang et al., CVPR 2025.

BibTeX

@inproceedings{yeo2026worldmm,
  title     = {WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning},
  author    = {Yeo, Woongyeong and Kim, Kangsan and Yoon, Jaehong and Hwang, Sung Ju},
  booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
  month     = {June},
  year      = {2026}
}