Methodology

How the Calor evaluation framework measures language effectiveness for AI agents.

Evaluation Approach

The framework compares Calor and C# implementations of the same programs across multiple dimensions. It combines deterministic static analysis with optional LLM-based evaluation for a comprehensive assessment.

Statistical Rigor

All metrics support statistical analysis mode:

Multiple runs (default: n=30) for variance measurement
95% confidence intervals for all ratios
Cohen's d effect size calculations
Paired t-tests for statistical significance (p < 0.05)

Bash

# Run with statistical analysis
dotnet run --project tests/Calor.Evaluation -- --statistical --runs 30

Test Corpus

Current Scale

40 programs across multiple categories
Each program has paired Calor (.calr) and C# (.cs) implementations
Complexity levels 1-5 (simple to advanced)

Category	Count	Examples
Basic Functions	10	HelloWorld, Calculator, FizzBuzz
Data Structures	4	Stack, Queue, LinkedList, BinaryTree
Design Patterns	4	Singleton, Factory, ShoppingCart
Algorithms	3	BinarySearch, QuickSort, Fibonacci
Contract Verification	7	ProvableContracts, OverflowSafe, StringContracts
Effect Soundness	3	CorrectEffects, MissingEffects, HiddenNetworkEffect
Interop Effects	2	BclCoverage, PureFunctions

Requirements

Both implementations must:

Compile successfully
Produce identical output for the same inputs
Have the same logical structure

The Metrics

1. Token Economics

Measures: Tokens required to represent equivalent logic.

Method:

Simple tokenization (split on whitespace and punctuation)
Character count (excluding whitespace)
Line count
Composite ratio

Interpretation: Lower is better (less context window usage).

2. Generation Accuracy

Measures: Ability to generate valid code.

Factors:

Compilation success (50%)
Structural completeness (30%)
Error count (20%)

Structural completeness:

Calor: Module, functions, bodies present
C#: Namespace, class, methods present

3. Comprehension

Measures: How easily an agent can understand code structure.

Calor factors:

Module declarations (§M{)
Function declarations (§F{)
Input/output annotations (§I{, §O{)
Effect declarations (§E{)
Contracts (§REQ, §ENS)
Closing tags (§/)

C# factors:

Namespace declarations
Class declarations
Documentation comments (///)
Type annotations
Contract patterns

Scoring: Weighted sum of factors present, normalized to 0-1.

4. Edit Precision

Measures: Ability to target specific code elements accurately.

Approach: Simulates actual edit tasks and measures success rate:

Change loop bounds by ID
Add preconditions to functions
Rename functions (ID vs name-based)
Change function signatures
Modify return types

Calor advantages:

Unique IDs enable precise targeting (§F{f001:)
Closing tags define clear boundaries
ID-based references don't break on rename

C# challenges:

Name collisions reduce targeting accuracy
Brace nesting creates ambiguity
Cascading changes needed for renames

Scoring: 40% structural analysis + 60% simulated edit success rate

5. Error Detection

Measures: Bug detection capability through explicit contracts.

Calor factors:

Preconditions (§REQ) - +0.25
Postconditions (§ENS) - +0.20
Invariants (§INV) - +0.15
Effect declarations - +0.10

C# factors:

Debug.Assert statements
Contract.Requires / Contract.Ensures
Null checks
Exception handling

6. Information Density

Measures: Semantic content per token.

Semantic elements counted:

Calor: Modules, functions, variables, type annotations, contracts, effects, control flow, expressions
C#: Namespaces, classes, methods, variables, type annotations, control flow, expressions

Density formula: Total semantic elements / token count

7. Refactoring Stability

Measures: How well unique IDs preserve references during code transformations.

Scenarios tested:

Scenario	What's Measured
Rename function	Does ID survive when name changes?
Extract method	Is new ID assigned, original preserved?
Move function	Does ID survive cross-module move?
Change signature	Do callers update correctly?
Inline variable	Do references resolve correctly?

Scoring weights:

ID preservation: 30%
Reference validity after edit: 25%
Minimal diff size: 20%
Semantic equivalence: 25%

LLM-Based Metrics

These metrics use actual LLM code generation to measure real-world effectiveness:

8. Task Completion

Measures: LLM code generation success using language-neutral prompts.

Method:

50 programming tasks across 4 categories
Same prompt given to both Calor and C# (no language-specific hints)
Calor generation uses a skills file teaching Contract-First Methodology
Generated code is compiled and tested against test suites

Scoring: 40% compilation success + 60% test pass rate

See the LLM Task Completion Benchmark section below for details.

9. Safety

Measures: Contract enforcement effectiveness and error quality for catching bugs.

Method:

Tests whether Calor contracts catch more bugs than C# guard clauses
Evaluates error message quality (informative vs cryptic)
Measures violation detection rate across input categories

What it catches:

Division by zero
Array bounds violations
Integer overflow
Null dereferences
Invalid argument values

10. Effect Discipline

Measures: Side effect management quality and bug prevention.

Method:

Tests whether code prevents real-world bugs caused by hidden side effects
Evaluates effect declaration completeness
Measures bug prevention rate for effect-related issues

What it catches:

Flaky tests (non-determinism from hidden state)
Security violations (unauthorized I/O)
Side effect transparency issues
Cache safety problems (memoization correctness)

Calor-Only Metrics

11. Interop Effect Coverage

Measures: BCL methods covered by effect manifests.

The BCL effect manifest tracks which .NET methods have effects, enabling verification even when calling external code. This has no C# equivalent since C# lacks an effect system.

These metrics explain why Calor achieves its overall advantage despite losing on token efficiency: the semantic verification pipeline and effect system catch classes of bugs that syntax-only tools cannot detect.

LLM Task Completion Benchmark

The Task Completion metric uses a dedicated LLM benchmark that directly measures AI code generation success using language-neutral prompts.

Design Philosophy: Neutral Prompts

The benchmark uses the same functional requirements for both languages without syntax hints:

Plain Text

Write a public function named Factorial that computes the factorial
of an integer n. The function must only accept non-negative values
of n. The result is always at least 1.

This tests whether an LLM can learn Calor and translate requirements into idiomatic code, rather than simply following syntax instructions.

How It Works

50 programming tasks defined across 4 categories
Same prompt given to both Calor and C# (no language-specific hints)
Calor generation uses a skills file teaching Contract-First Methodology
Generated code is compiled and tested against test suites
Scores computed based on compilation success and test pass rate

Task Categories

Category	Count	Examples
basic-algorithms	15	Factorial, Fibonacci, IsPrime, GCD, Power
safety	10	SafeDivide, Clamp, SafeModulo, NormalizeScore
data-structures	10	Sum, Max, Min, Average, Median
logic	15	BoolToInt, LogicalAnd, IsMultipleOf, SameSign

Scoring Formula

Plain Text

Score = 0.4 × compilation + 0.6 × tests

Factor	Weight	Description
Compilation	40%	Does the generated code compile?
Test Cases	60%	Percentage of test cases passing

No bonus for contracts - we measure task completion fairly.

Results Summary

Metric	Calor	C#
Average Score	1.00	1.00
Advantage Ratio	1.00x (Tie)	-
Compilation Rate	100%	100%
Test Pass Rate	100%	100%

The 1.00x ratio demonstrates that Calor is as learnable as C# - LLMs can write correct code in both languages equally well.

Contract Usage (Qualitative)

While task completion scores are equal, examining the generated code shows Calor consistently produces contracts:

Requirement	Calor	C#
"must not be zero"	`§Q (!= b 0)`	`if (b == 0) throw...`
"result is never negative"	`§S (>= result 0)`	(no equivalent)
"must be between X and Y"	`§Q (>= n X)` + `§Q (<= n Y)`	Two guard clauses

Calor includes postconditions that C# cannot express - a qualitative advantage not captured by task completion scores.

Running Locally

Bash

# Run LLM benchmark with neutral prompts
dotnet run --project tests/Calor.Evaluation -- llm-tasks \
  --manifest tests/Calor.Evaluation/Tasks/task-manifest-neutral.json \
  --verbose

# Refresh cache (re-run all tasks)
dotnet run --project tests/Calor.Evaluation -- llm-tasks \
  --manifest tests/Calor.Evaluation/Tasks/task-manifest-neutral.json \
  --refresh-cache

Cost Controls

Caching: Results cached to avoid redundant API calls
Budget caps: Maximum spend per run (~$0.76 for 50 tasks)
Estimated cost displayed before running

LLM-Based Evaluation

Multi-LLM Validation

To ensure unbiased results, we test with multiple LLMs:

Provider	Model	Purpose
Anthropic	Claude 3.5 Sonnet	Primary evaluation
OpenAI	GPT-4o	Cross-validation

Comprehension Questions

LLMs answer questions about code understanding:

What is the main purpose of this code?
What are the input parameters and their constraints?
What does this function return?
What side effects does this code have?
What would happen if [edge case]?

Scoring

Correctness of answers (0-1) graded against ground truth
Tokens used to formulate answer
Consistency across multiple runs
Cross-model agreement (higher = more reliable)

Running the Evaluation

Basic Run

Bash

# JSON output (default)
dotnet run --project tests/Calor.Evaluation -- run --output report.json

# Markdown output
dotnet run --project tests/Calor.Evaluation -- run --format markdown --output report.md

# Website dashboard format
dotnet run --project tests/Calor.Evaluation -- run --format website --output results.json

# HTML dashboard
dotnet run --project tests/Calor.Evaluation -- run --format html --output dashboard.html

Statistical Analysis

Bash

# Run with 30 statistical samples
dotnet run --project tests/Calor.Evaluation -- run --statistical --runs 30

# Custom run count
dotnet run --project tests/Calor.Evaluation -- run --statistical --runs 50

Specific Metrics

Bash

# Run only specific categories
dotnet run --project tests/Calor.Evaluation -- run \
  --category Comprehension \
  --category EditPrecision \
  --category RefactoringStability

Interpreting Results

Winner Determination

For each metric:

Higher is better (Comprehension, Error Detection, etc.): Calor wins if ratio > 1.0
Lower is better (Token Economics): C# wins if ratio < 1.0

Statistical Significance

With statistical mode enabled:

p < 0.05: Result is statistically significant
Cohen's d: Effect size interpretation
- d < 0.2: Negligible
- d < 0.5: Small
- d < 0.8: Medium
- d ≥ 0.8: Large

The Tradeoff

Results consistently show:

Calor wins on comprehension metrics (structure, contracts, effects, refactoring)
C# wins on efficiency metrics (tokens, density)

This reflects the designed tradeoff: explicit semantics require more tokens but enable better agent reasoning and safer refactoring.

CI/CD Integration

Automated Benchmarks

Benchmarks run automatically via GitHub Actions:

On every push: Quick validation run
Weekly (Sunday): Full statistical analysis with LLM evaluation
Manual trigger: On-demand with custom parameters

Regression Detection

The CI pipeline checks for:

Any metric dropping by more than 10%
Statistical significance of changes
Failing compilation in any benchmark

Limitations

Corpus size: 40 programs may not cover all patterns
C# baseline: Other languages might perform differently
Static analysis: Some metrics don't capture runtime behavior
LLM variance: Model updates can affect evaluation scores

Results - Detailed results table
Individual Metrics - Deep dive into each metric