Performance
MIRIX delivers exceptional performance through intelligent memory consolidation, optimized search algorithms, and efficient data processing.
Experimental Setup
We evaluate MIRIX on two comprehensive datasets to demonstrate its superior performance across different scenarios:
LOCOMO Dataset
Following Mem0, we use the LOCOMO dataset for vertical comparison between MIRIX and existing memory systems. LOCOMO contains:
- 10 conversations with 600 dialogues each
- 26,000 tokens on average per conversation
- 200 questions per conversation across multiple categories:
- Single-hop questions
- Multi-hop questions
- Temporal questions
- Open-domain questions
ScreenshotVQA Dataset
We collected a novel dataset containing three PhD students' computer activities:
- Student 1: 5,886 screenshots (1 day, heavy usage)
- Student 2: 18,178 screenshots (3 weeks, moderate usage)
- Student 3: 5,349 screenshots (6 weeks, light usage)
- Total questions: 87 manually created and verified questions
Screenshots were captured every second, with duplicate filtering (similarity > 0.99).
Evaluation Metrics
- LLM-as-a-Judge: Using GPT-4.1 to evaluate response quality
- Accuracy: Percentage of correctly answered questions
- Storage: Memory footprint comparison
Experimental Results
LOCOMO Dataset Results
Method | Single Hop | Multi-Hop | Open Domain | Temporal | Overall |
---|---|---|---|---|---|
GPT-4o-mini backbone | |||||
A-Mem | 39.79 | 18.85 | 54.05 | 49.91 | 48.38 |
LangMem | 62.23 | 47.92 | 71.12 | 23.43 | 58.10 |
OpenAI | 63.79 | 42.92 | 62.29 | 21.71 | 52.90 |
Mem0 | 67.13 | 51.15 | 72.93 | 55.51 | 66.88 |
Mem0ᵍ | 65.71 | 47.19 | 75.71 | 58.13 | 68.44 |
Memobase | 63.83 | 52.08 | 71.82 | 80.37 | 70.91 |
Zep | 74.11 | 66.04 | 67.71 | 79.76 | 75.14 |
GPT-4.1-mini backbone | |||||
LangMem | 74.47 | 61.06 | 67.71 | 86.92 | 78.05 |
RAG-500 | 37.94 | 37.69 | 48.96 | 61.83 | 51.62 |
Zep | 79.43 | 69.16 | 73.96 | 83.33 | 79.09 |
Mem0 | 62.41 | 57.32 | 44.79 | 66.47 | 62.47 |
MIRIX | 85.11 | 83.70 | 65.62 | 88.39 | 85.38 |
Full-Context | 88.53 | 77.70 | 71.88 | 92.70 | 87.52 |
LLM-as-a-Judge scores (%, higher is better) for each question type in the LOCOMO dataset. Full-Context represents the upper-bound performance.
ScreenshotVQA Results
Method | Student 1 | Student 2 | Student 3 | Overall | ||||
---|---|---|---|---|---|---|---|---|
Acc ↑ | Storage ↓ | Acc ↑ | Storage ↓ | Acc ↑ | Storage ↓ | Acc ↑ | Storage ↓ | |
Gemini | 0.00% | 142.10MB | 9.52% | 438.86MB | 25.45% | 129.14MB | 11.66% | 236.70MB |
SigLIP@50 | 36.36% | 22.55GB | 41.38% | 19.88GB | 54.55% | 2.82GB | 44.10% | 15.07GB |
MIRIX | 54.55% | 20.57MB | 56.67% | 19.83MB | 67.27% | 7.28MB | 59.50% | 15.89MB |
Key Performance Insights
Superior Multi-Hop Reasoning
MIRIX shows the largest improvement in multi-hop questions, outperforming baselines by more than 22 points. This highlights the effectiveness of our hierarchical memory system at retrieving the most relevant information across complex reasoning chains.
Near Full-Context Performance
On single-hop and temporal tasks, MIRIX almost matches full-context performance while using only a small number of retrieved memory snippets. This validates our typed memory storage approach.
Exceptional Efficiency
In ScreenshotVQA, MIRIX achieves: - 59.50% overall accuracy (vs 44.10% for SigLIP@50) - 52.56MB average storage (vs 15.07GB for SigLIP@50) - 285x storage reduction compared to traditional approaches
Scalable Architecture
MIRIX's component-specific memory management scales efficiently across different data types and conversation lengths, maintaining performance as memory grows.
Implementation Details
LOCOMO Configuration
- Backbone Model: GPT-4.1-mini (superior function calling: 29.75% vs 22.12% for GPT-4o-mini)
- Memory Architecture: Hierarchical multi-agent system with specialized memory managers
- Retrieval Strategy: Intelligent routing based on question type and memory content
ScreenshotVQA Configuration
- Vision Model: Gemini-2.5-flash-preview-04-17
- Cloud Integration: Asynchronous image processing via Google Cloud
- Function Calls: 1-7 calls per processing step (1 for meta memory manager, 0-6 for specialized managers)
Comparative Analysis
vs. Traditional RAG Systems: MIRIX overcomes the global understanding limitations of simple RAG through its multi-agent memory architecture.
vs. Full-Context Methods: While maintaining 97% of full-context performance, MIRIX uses dramatically less computational resources and storage.
vs. Existing Memory Systems: MIRIX outperforms all tested memory systems (LangMem, Mem0, Zep, Memobase) across most evaluation categories.
What's Next?
Return to explore other advanced features: