A pluggable benchmarking framework for evaluating memory and context systems.
- π Interoperable: mix and match any provider with any benchmark
- π§© Bring your own benchmarks: plug in custom datasets and tasks
- β»οΈ Checkpointed runs: resume from any pipeline stage (ingest β index β search β answer β evaluate)
- π Multiβprovider comparison: run the same benchmark across providers sideβbyβside
- π§ͺ Judgeβagnostic: swap GPTβ4o, Claude, Gemini, etc. without code changes
- π Structured reports: export run status, failures, and metrics for analysis
- π₯οΈ Web UI: inspect runs, questions, and failures interactively, in real-time!
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Benchmarks β β Providers β β Judges β
β (LoCoMo, β β (Supermem, β β (GPT-4o, β
β LongMem..) β β Mem0, Zep) β β Claude..) β
ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ
ββββββββββββββββββββΌβββββββββββββββββββ
βΌ
βββββββββββββββββββββββββ
β MemoryBench β
βββββββββββββ¬ββββββββββββ
βΌ
ββββββββββ¬ββββββββββ¬βββββββββ¬βββββββββββ¬βββββββββ
β Ingest β Indexingβ Search β Answer βEvaluateβ
ββββββββββ΄ββββββββββ΄βββββββββ΄βββββββββββ΄βββββββββ
bun install
cp .env.example .env.local # Add your API keys
bun run src/index.ts run -p supermemory -b locomo# Providers (at least one)
SUPERMEMORY_API_KEY=
MEM0_API_KEY=
ZEP_API_KEY=
# Judges (at least one)
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GOOGLE_API_KEY=| Command | Description |
|---|---|
run |
Full pipeline: ingest β index β search β answer β evaluate β report |
compare |
Run benchmark across multiple providers simultaneously |
ingest |
Ingest benchmark data into provider |
search |
Run search phase only |
test |
Test single question |
status |
Check run progress |
list-questions |
Browse benchmark questions |
show-failures |
Debug failed questions |
serve |
Start web UI |
help |
Show help (help providers, help models, help benchmarks) |
-p, --provider Memory provider (supermemory, mem0, zep)
-b, --benchmark Benchmark (locomo, longmemeval, convomem)
-j, --judge Judge model (gpt-4o, sonnet-4, gemini-2.5-flash, etc.)
-r, --run-id Run identifier (auto-generated if omitted)
-m, --answering-model Model for answer generation (default: gpt-4o)
-l, --limit Limit number of questions
-q, --question-id Specific question (for test command)
--force Clear checkpoint and restart
# Full run
bun run src/index.ts run -p mem0 -b locomo
# With custom run ID
bun run src/index.ts run -p mem0 -b locomo -r my-test
# Resume existing run
bun run src/index.ts run -r my-test
# Limited questions
bun run src/index.ts run -p supermemory -b locomo -l 10
# Different models
bun run src/index.ts run -p zep -b longmemeval -j sonnet-4 -m gemini-2.5-flash
# Compare multiple providers
bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -s 5
# Test single question
bun run src/index.ts test -r my-test -q question_42
# Debug
bun run src/index.ts status -r my-test
bun run src/index.ts show-failures -r my-test1. INGEST Load benchmark sessions β Push to provider
2. INDEX Wait for provider indexing
3. SEARCH Query provider β Retrieve context
4. ANSWER Build prompt β Generate answer via LLM
5. EVALUATE Compare to ground truth β Score via judge
6. REPORT Aggregate scores β Output accuracy + latency
Each phase checkpoints independently. Failed runs resume from last successful point.
MemScore is a composite metric that captures three dimensions of provider performance in a single line:
accuracy% / latencyMs / contextTokens
| Component | What it measures |
|---|---|
| Quality | Answer accuracy β (correct / total) * 100 from judge evaluations |
| Latency | Average search response time in milliseconds |
| Tokens | Average context tokens sent to the answering model (counted client-side) |
After a run completes, MemScore appears in the CLI summary:
Summary:
Total Questions: 50
Correct: 43
Accuracy: 86.00%
MemScore: 86% / 145ms / 1823tok
MemScore is intentionally a triple, not a single number β collapsing quality, latency, and cost into one score hides important tradeoffs. Use it to compare providers side-by-side on the same benchmark:
bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -j gpt-4oThe report.json includes both a display string and structured memscoreComponents for programmatic use.
Runs persist to data/runs/{runId}/:
checkpoint.json- Run state and progressresults/- Search results per questionreport.json- Final report
Re-running same ID resumes. Use --force to restart.
| Component | Guide |
|---|---|
| Add Provider | src/providers/README.md |
| Add Benchmark | src/benchmarks/README.md |
| Add Judge | src/judges/README.md |
| Project Structure | src/README.md |
MIT