Benchmarking

Calor is evaluated against C# across 11 metrics designed to measure what matters for AI coding agents.


The Metrics

Static Analysis Metrics

CategoryWhat It MeasuresWhy It Matters
ComprehensionStructural clarity, semantic extractabilityCan agents understand code without deep analysis?
Error DetectionBug identification, contract violation detectionCan agents find issues using explicit semantics?
Edit PrecisionTargeting accuracy, change isolationCan agents make precise edits using unique IDs?
Generation AccuracyCompilation success, structural correctnessCan agents produce valid code?
Token EconomicsTokens required to represent logicHow much context window does code consume?
Information DensitySemantic elements per tokenHow much meaning per token?
Refactoring StabilityID-based reference preservationDo unique IDs survive code transformations?

LLM-Based Metrics

These metrics use actual LLM code generation to measure real-world effectiveness:

CategoryWhat It MeasuresWhy It Matters
Task CompletionLLM code generation successCan AI complete tasks with Calor vs C#?
SafetyContract enforcement effectivenessDoes code catch bugs with informative errors?
Effect DisciplineSide effect management, bug preventionDoes code prevent flaky tests, security violations?
CorrectnessEdge case handling, bug preventionDoes code produce correct results for edge cases?

Summary Results

CategoryCalor vs C#Winner
Comprehension2.22xCalor
Error Detection1.83xCalor
Edit Precision1.39xCalor
Refactoring Stability1.52xCalor
Correctness1.30xCalor
Generation Accuracy1.02xCalor
Token Economics0.79xC#
Information Density1.15xCalor

Pattern: Calor wins on comprehension and precision metrics. C# wins on efficiency metrics.


Key Insight

Calor excels where explicitness matters:

  • Comprehension (1.51x) - Explicit structure aids understanding
  • Error Detection (1.22x) - Contracts surface invariant violations
  • Edit Precision (1.37x) - Unique IDs enable targeted changes
  • Refactoring Stability (1.36x) - ID-based references survive transformations
  • Safety (1.59x) - Contracts catch more bugs with better error messages
  • Task Completion (1.22x) - LLMs complete tasks better with Calor's explicit contracts

Calor and C# are equivalent on:

  • Correctness (1.0x) - Both languages achieve 100% on edge case handling when benchmarked fairly
  • Effect Discipline (1.0x) - Both languages can write deterministic, side-effect-free code when benchmarked fairly

C# wins on efficiency:

  • Token Economics (0.83x) - Calor's explicit syntax uses more tokens
  • Generation Accuracy (0.97x) - C# has broader training data

This reflects a fundamental tradeoff: explicit semantics require more tokens but enable better agent reasoning, safer code, and higher task completion rates.


Agent Task Benchmark

The Agent Task Benchmark tests Claude's ability to generate correct Calor code from natural language prompts. It validates skill documentation quality by measuring how well Claude can learn Calor syntax.

Current Pass Rate: 86% across 89 tasks in 17 categories.


Learn More