Benchmarking
Calor is evaluated against C# across 11 metrics designed to measure what matters for AI coding agents.
The Metrics
Static Analysis Metrics
| Category | What It Measures | Why It Matters |
|---|---|---|
| Comprehension | Structural clarity, semantic extractability | Can agents understand code without deep analysis? |
| Error Detection | Bug identification, contract violation detection | Can agents find issues using explicit semantics? |
| Edit Precision | Targeting accuracy, change isolation | Can agents make precise edits using unique IDs? |
| Generation Accuracy | Compilation success, structural correctness | Can agents produce valid code? |
| Token Economics | Tokens required to represent logic | How much context window does code consume? |
| Information Density | Semantic elements per token | How much meaning per token? |
| Refactoring Stability | ID-based reference preservation | Do unique IDs survive code transformations? |
LLM-Based Metrics
These metrics use actual LLM code generation to measure real-world effectiveness:
| Category | What It Measures | Why It Matters |
|---|---|---|
| Task Completion | LLM code generation success | Can AI complete tasks with Calor vs C#? |
| Safety | Contract enforcement effectiveness | Does code catch bugs with informative errors? |
| Effect Discipline | Side effect management, bug prevention | Does code prevent flaky tests, security violations? |
| Correctness | Edge case handling, bug prevention | Does code produce correct results for edge cases? |
Summary Results
| Category | Calor vs C# | Winner |
|---|---|---|
| Comprehension | 2.22x | Calor |
| Error Detection | 1.83x | Calor |
| Edit Precision | 1.39x | Calor |
| Refactoring Stability | 1.52x | Calor |
| Correctness | 1.30x | Calor |
| Generation Accuracy | 1.02x | Calor |
| Token Economics | 0.79x | C# |
| Information Density | 1.15x | Calor |
Pattern: Calor wins on comprehension and precision metrics. C# wins on efficiency metrics.
Key Insight
Calor excels where explicitness matters:
- Comprehension (1.51x) - Explicit structure aids understanding
- Error Detection (1.22x) - Contracts surface invariant violations
- Edit Precision (1.37x) - Unique IDs enable targeted changes
- Refactoring Stability (1.36x) - ID-based references survive transformations
- Safety (1.59x) - Contracts catch more bugs with better error messages
- Task Completion (1.22x) - LLMs complete tasks better with Calor's explicit contracts
Calor and C# are equivalent on:
- Correctness (1.0x) - Both languages achieve 100% on edge case handling when benchmarked fairly
- Effect Discipline (1.0x) - Both languages can write deterministic, side-effect-free code when benchmarked fairly
C# wins on efficiency:
- Token Economics (0.83x) - Calor's explicit syntax uses more tokens
- Generation Accuracy (0.97x) - C# has broader training data
This reflects a fundamental tradeoff: explicit semantics require more tokens but enable better agent reasoning, safer code, and higher task completion rates.
Agent Task Benchmark
The Agent Task Benchmark tests Claude's ability to generate correct Calor code from natural language prompts. It validates skill documentation quality by measuring how well Claude can learn Calor syntax.
Current Pass Rate: 86% across 89 tasks in 17 categories.
Learn More
- Methodology - How benchmarks work
- Results - Detailed results table
- Agent Tasks - Claude code generation benchmark
- Individual Metrics - Deep dive into each metric