LLM benchmarking leaderboard that evaluates how well AI models understand Appwrite services. Compare model performance with and without skill file context across 70 questions spanning 7 Appwrite product categories.
Live at arena.appwrite.network
The benchmark tests leading AI models on their knowledge of Appwrite through two modes:
- With skills β Models receive comprehensive Appwrite documentation as context
- Without skills β Models answer based solely on their training data
Questions are split into 57 multiple-choice (auto-scored) and 13 free-form (AI-judged by Claude Sonnet 4.6) across these categories:
| Category | Topics |
|---|---|
| Foundation | Core Appwrite concepts and architecture |
| Auth | Authentication, users, teams, OAuth |
| TablesDB | Tables, rows, queries, permissions |
| Functions | Serverless functions, runtimes, triggers |
| Storage | File uploads, buckets, previews |
| Sites | Web hosting and deployment |
| Messaging | Email, SMS, push notifications |
- Claude Opus 4.7 β Anthropic
- GPT 5.5 β OpenAI
- Gemini 3.1 Pro (Preview) and Gemini 3.1 Flash Lite (Preview) β Google
- Grok 4.3 β xAI
- DeepSeek V4 Flash β DeepSeek
- Qwen 3.6 Plus β Alibaba
- GLM 5.1 β Zhipu
- MiniMax M2.7 β MiniMax
- Mistral Large 3 2512 β Mistral
- Kimi K2.6 β MoonshotAI
All models are accessed via OpenRouter with temperature set to 0 for deterministic results.
Frontend: React, TanStack Start, Tailwind CSS, Vite, TypeScript
Benchmark: Bun, OpenRouter
- Node.js 18+
- Bun (for benchmark scripts and pre-build step)
npm install
npm run devThe app runs at http://localhost:3000.
npm run build
npm run previewnpm run lint
npm run format
npm run checknpm run testThe benchmark suite lives in the benchmark/ directory and requires an OpenRouter API key.
cd benchmark
cp .env.example .env
# Fill in your API key in .env
# Run both modes
bun run bench:all
# Or run individually
bun run bench:with-skills
bun run bench:without-skillsKeep in mind, the benchmark only fills missing data in result JSON files, to minimize cost. If you intend to re-run the benchmark on existing results, you should delete the contents of the JSON files first.
Results are saved to src/data/results-with-skills.json and src/data/results-without-skills.json, which the frontend reads at build time.
βββ src/ # Frontend application
β βββ components/ # React UI components
β βββ routes/ # File-based routes (TanStack Router)
β βββ data/ # Static benchmark result JSON files
β βββ lib/ # Types, utilities, site config
βββ benchmark/ # Benchmark suite
β βββ src/
β β βββ questions/ # 70 questions across 7 categories
β β βββ skills/ # Appwrite documentation for context mode
β β βββ runner.ts # Test execution logic
β β βββ judge.ts # AI judge for free-form answers
β β βββ config.ts # Model definitions and settings
β βββ package.json
βββ scripts/ # Build-time scripts (GitHub stars fetcher)
βββ public/ # Static assets
MIT β see LICENSE.