Benchmarking

Calor is evaluated against C# across 11 metrics designed to measure what matters for AI coding agents.

The Metrics

Static Analysis Metrics

Category	What It Measures	Why It Matters
Comprehension	Structural clarity, semantic extractability	Can agents understand code without deep analysis?
Error Detection	Bug identification, contract violation detection	Can agents find issues using explicit semantics?
Edit Precision	Targeting accuracy, change isolation	Can agents make precise edits using unique IDs?
Generation Accuracy	Compilation success, structural correctness	Can agents produce valid code?
Token Economics	Tokens required to represent logic	How much context window does code consume?
Information Density	Semantic elements per token	How much meaning per token?
Refactoring Stability	ID-based reference preservation	Do unique IDs survive code transformations?

LLM-Based Metrics

These metrics use actual LLM code generation to measure real-world effectiveness:

Category	What It Measures	Why It Matters
Task Completion	LLM code generation success	Can AI complete tasks with Calor vs C#?
Safety	Contract enforcement effectiveness	Does code catch bugs with informative errors?
Effect Discipline	Side effect management, bug prevention	Does code prevent flaky tests, security violations?
Correctness	Edge case handling, bug prevention	Does code produce correct results for edge cases?

Summary Results

Category	Calor vs C#	Winner
Comprehension	2.22x	Calor
Error Detection	1.83x	Calor
Edit Precision	1.39x	Calor
Refactoring Stability	1.52x	Calor
Correctness	1.30x	Calor
Generation Accuracy	1.02x	Calor
Token Economics	0.79x	C#
Information Density	1.15x	Calor

Pattern: Calor wins on comprehension and precision metrics. C# wins on efficiency metrics.

Key Insight

Calor excels where explicitness matters:

Comprehension (1.51x) - Explicit structure aids understanding
Error Detection (1.22x) - Contracts surface invariant violations
Edit Precision (1.37x) - Unique IDs enable targeted changes
Refactoring Stability (1.36x) - ID-based references survive transformations
Safety (1.59x) - Contracts catch more bugs with better error messages
Task Completion (1.22x) - LLMs complete tasks better with Calor's explicit contracts

Calor and C# are equivalent on:

Correctness (1.0x) - Both languages achieve 100% on edge case handling when benchmarked fairly
Effect Discipline (1.0x) - Both languages can write deterministic, side-effect-free code when benchmarked fairly

C# wins on efficiency:

Token Economics (0.83x) - Calor's explicit syntax uses more tokens
Generation Accuracy (0.97x) - C# has broader training data

This reflects a fundamental tradeoff: explicit semantics require more tokens but enable better agent reasoning, safer code, and higher task completion rates.

The Agent Task Benchmark tests Claude's ability to generate correct Calor code from natural language prompts. It validates skill documentation quality by measuring how well Claude can learn Calor syntax.

Current Pass Rate: 86% across 89 tasks in 17 categories.

Learn More

Methodology - How benchmarks work
Results - Detailed results table
Agent Tasks - Claude code generation benchmark
Individual Metrics - Deep dive into each metric