Skip to content

Performance

MIRIX delivers exceptional performance through intelligent memory consolidation, optimized search algorithms, and efficient data processing.

Experimental Setup

We evaluate MIRIX on two comprehensive datasets to demonstrate its superior performance across different scenarios:

LOCOMO Dataset

Following Mem0, we use the LOCOMO dataset for vertical comparison between MIRIX and existing memory systems. LOCOMO contains:

  • 10 conversations with 600 dialogues each
  • 26,000 tokens on average per conversation
  • 200 questions per conversation across multiple categories:
  • Single-hop questions
  • Multi-hop questions
  • Temporal questions
  • Open-domain questions

ScreenshotVQA Dataset

We collected a novel dataset containing three PhD students' computer activities:

  • Student 1: 5,886 screenshots (1 day, heavy usage)
  • Student 2: 18,178 screenshots (3 weeks, moderate usage)
  • Student 3: 5,349 screenshots (6 weeks, light usage)
  • Total questions: 87 manually created and verified questions

Screenshots were captured every second, with duplicate filtering (similarity > 0.99).

Evaluation Metrics

  • LLM-as-a-Judge: Using GPT-4.1 to evaluate response quality
  • Accuracy: Percentage of correctly answered questions
  • Storage: Memory footprint comparison

Experimental Results

LOCOMO Dataset Results

Method Single Hop Multi-Hop Open Domain Temporal Overall
GPT-4o-mini backbone
A-Mem 39.79 18.85 54.05 49.91 48.38
LangMem 62.23 47.92 71.12 23.43 58.10
OpenAI 63.79 42.92 62.29 21.71 52.90
Mem0 67.13 51.15 72.93 55.51 66.88
Mem0ᵍ 65.71 47.19 75.71 58.13 68.44
Memobase 63.83 52.08 71.82 80.37 70.91
Zep 74.11 66.04 67.71 79.76 75.14
GPT-4.1-mini backbone
LangMem 74.47 61.06 67.71 86.92 78.05
RAG-500 37.94 37.69 48.96 61.83 51.62
Zep 79.43 69.16 73.96 83.33 79.09
Mem0 62.41 57.32 44.79 66.47 62.47
MIRIX 85.11 83.70 65.62 88.39 85.38
Full-Context 88.53 77.70 71.88 92.70 87.52

LLM-as-a-Judge scores (%, higher is better) for each question type in the LOCOMO dataset. Full-Context represents the upper-bound performance.

ScreenshotVQA Results

Method Student 1 Student 2 Student 3 Overall
Acc ↑ Storage ↓ Acc ↑ Storage ↓ Acc ↑ Storage ↓ Acc ↑ Storage ↓
Gemini 0.00% 142.10MB 9.52% 438.86MB 25.45% 129.14MB 11.66% 236.70MB
SigLIP@50 36.36% 22.55GB 41.38% 19.88GB 54.55% 2.82GB 44.10% 15.07GB
MIRIX 54.55% 20.57MB 56.67% 19.83MB 67.27% 7.28MB 59.50% 15.89MB

Key Performance Insights

Superior Multi-Hop Reasoning

MIRIX shows the largest improvement in multi-hop questions, outperforming baselines by more than 22 points. This highlights the effectiveness of our hierarchical memory system at retrieving the most relevant information across complex reasoning chains.

Near Full-Context Performance

On single-hop and temporal tasks, MIRIX almost matches full-context performance while using only a small number of retrieved memory snippets. This validates our typed memory storage approach.

Exceptional Efficiency

In ScreenshotVQA, MIRIX achieves: - 59.50% overall accuracy (vs 44.10% for SigLIP@50) - 52.56MB average storage (vs 15.07GB for SigLIP@50) - 285x storage reduction compared to traditional approaches

Scalable Architecture

MIRIX's component-specific memory management scales efficiently across different data types and conversation lengths, maintaining performance as memory grows.

Implementation Details

LOCOMO Configuration

  • Backbone Model: GPT-4.1-mini (superior function calling: 29.75% vs 22.12% for GPT-4o-mini)
  • Memory Architecture: Hierarchical multi-agent system with specialized memory managers
  • Retrieval Strategy: Intelligent routing based on question type and memory content

ScreenshotVQA Configuration

  • Vision Model: Gemini-2.5-flash-preview-04-17
  • Cloud Integration: Asynchronous image processing via Google Cloud
  • Function Calls: 1-7 calls per processing step (1 for meta memory manager, 0-6 for specialized managers)

Comparative Analysis

vs. Traditional RAG Systems: MIRIX overcomes the global understanding limitations of simple RAG through its multi-agent memory architecture.

vs. Full-Context Methods: While maintaining 97% of full-context performance, MIRIX uses dramatically less computational resources and storage.

vs. Existing Memory Systems: MIRIX outperforms all tested memory systems (LangMem, Mem0, Zep, Memobase) across most evaluation categories.

What's Next?

Return to explore other advanced features:

Security & Privacy → Backup & Restore →