Methodology

How the Calor evaluation framework measures language effectiveness for AI agents.


Evaluation Approach

The framework compares Calor and C# implementations of the same programs across multiple dimensions. It combines deterministic static analysis with optional LLM-based evaluation for a comprehensive assessment.

Statistical Rigor

All metrics support statistical analysis mode:

  • Multiple runs (default: n=30) for variance measurement
  • 95% confidence intervals for all ratios
  • Cohen's d effect size calculations
  • Paired t-tests for statistical significance (p < 0.05)
Bash
# Run with statistical analysis
dotnet run --project tests/Calor.Evaluation -- --statistical --runs 30

Test Corpus

Current Scale

  • 40 programs across multiple categories
  • Each program has paired Calor (.calr) and C# (.cs) implementations
  • Complexity levels 1-5 (simple to advanced)

Categories

CategoryCountExamples
Basic Functions10HelloWorld, Calculator, FizzBuzz
Data Structures4Stack, Queue, LinkedList, BinaryTree
Design Patterns4Singleton, Factory, ShoppingCart
Algorithms3BinarySearch, QuickSort, Fibonacci
Contract Verification7ProvableContracts, OverflowSafe, StringContracts
Effect Soundness3CorrectEffects, MissingEffects, HiddenNetworkEffect
Interop Effects2BclCoverage, PureFunctions

Requirements

Both implementations must:

  1. Compile successfully
  2. Produce identical output for the same inputs
  3. Have the same logical structure

The Metrics

1. Token Economics

Measures: Tokens required to represent equivalent logic.

Method:

  • Simple tokenization (split on whitespace and punctuation)
  • Character count (excluding whitespace)
  • Line count
  • Composite ratio

Interpretation: Lower is better (less context window usage).


2. Generation Accuracy

Measures: Ability to generate valid code.

Factors:

  • Compilation success (50%)
  • Structural completeness (30%)
  • Error count (20%)

Structural completeness:

  • Calor: Module, functions, bodies present
  • C#: Namespace, class, methods present

3. Comprehension

Measures: How easily an agent can understand code structure.

Calor factors:

  • Module declarations (§M{)
  • Function declarations (§F{)
  • Input/output annotations (§I{, §O{)
  • Effect declarations (§E{)
  • Contracts (§REQ, §ENS)
  • Closing tags (§/)

C# factors:

  • Namespace declarations
  • Class declarations
  • Documentation comments (///)
  • Type annotations
  • Contract patterns

Scoring: Weighted sum of factors present, normalized to 0-1.


4. Edit Precision

Measures: Ability to target specific code elements accurately.

Approach: Simulates actual edit tasks and measures success rate:

  • Change loop bounds by ID
  • Add preconditions to functions
  • Rename functions (ID vs name-based)
  • Change function signatures
  • Modify return types

Calor advantages:

  • Unique IDs enable precise targeting (§F{f001:)
  • Closing tags define clear boundaries
  • ID-based references don't break on rename

C# challenges:

  • Name collisions reduce targeting accuracy
  • Brace nesting creates ambiguity
  • Cascading changes needed for renames

Scoring: 40% structural analysis + 60% simulated edit success rate


5. Error Detection

Measures: Bug detection capability through explicit contracts.

Calor factors:

  • Preconditions (§REQ) - +0.25
  • Postconditions (§ENS) - +0.20
  • Invariants (§INV) - +0.15
  • Effect declarations - +0.10

C# factors:

  • Debug.Assert statements
  • Contract.Requires / Contract.Ensures
  • Null checks
  • Exception handling

6. Information Density

Measures: Semantic content per token.

Semantic elements counted:

  • Calor: Modules, functions, variables, type annotations, contracts, effects, control flow, expressions
  • C#: Namespaces, classes, methods, variables, type annotations, control flow, expressions

Density formula: Total semantic elements / token count


7. Refactoring Stability

Measures: How well unique IDs preserve references during code transformations.

Scenarios tested:

ScenarioWhat's Measured
Rename functionDoes ID survive when name changes?
Extract methodIs new ID assigned, original preserved?
Move functionDoes ID survive cross-module move?
Change signatureDo callers update correctly?
Inline variableDo references resolve correctly?

Scoring weights:

  • ID preservation: 30%
  • Reference validity after edit: 25%
  • Minimal diff size: 20%
  • Semantic equivalence: 25%

LLM-Based Metrics

These metrics use actual LLM code generation to measure real-world effectiveness:

8. Task Completion

Measures: LLM code generation success using language-neutral prompts.

Method:

  • 50 programming tasks across 4 categories
  • Same prompt given to both Calor and C# (no language-specific hints)
  • Calor generation uses a skills file teaching Contract-First Methodology
  • Generated code is compiled and tested against test suites

Scoring: 40% compilation success + 60% test pass rate

See the LLM Task Completion Benchmark section below for details.


9. Safety

Measures: Contract enforcement effectiveness and error quality for catching bugs.

Method:

  • Tests whether Calor contracts catch more bugs than C# guard clauses
  • Evaluates error message quality (informative vs cryptic)
  • Measures violation detection rate across input categories

What it catches:

  • Division by zero
  • Array bounds violations
  • Integer overflow
  • Null dereferences
  • Invalid argument values

10. Effect Discipline

Measures: Side effect management quality and bug prevention.

Method:

  • Tests whether code prevents real-world bugs caused by hidden side effects
  • Evaluates effect declaration completeness
  • Measures bug prevention rate for effect-related issues

What it catches:

  • Flaky tests (non-determinism from hidden state)
  • Security violations (unauthorized I/O)
  • Side effect transparency issues
  • Cache safety problems (memoization correctness)

Calor-Only Metrics

11. Interop Effect Coverage

Measures: BCL methods covered by effect manifests.

The BCL effect manifest tracks which .NET methods have effects, enabling verification even when calling external code. This has no C# equivalent since C# lacks an effect system.


These metrics explain why Calor achieves its overall advantage despite losing on token efficiency: the semantic verification pipeline and effect system catch classes of bugs that syntax-only tools cannot detect.


LLM Task Completion Benchmark

The Task Completion metric uses a dedicated LLM benchmark that directly measures AI code generation success using language-neutral prompts.

Design Philosophy: Neutral Prompts

The benchmark uses the same functional requirements for both languages without syntax hints:

Plain Text
Write a public function named Factorial that computes the factorial
of an integer n. The function must only accept non-negative values
of n. The result is always at least 1.

This tests whether an LLM can learn Calor and translate requirements into idiomatic code, rather than simply following syntax instructions.

How It Works

  1. 50 programming tasks defined across 4 categories
  2. Same prompt given to both Calor and C# (no language-specific hints)
  3. Calor generation uses a skills file teaching Contract-First Methodology
  4. Generated code is compiled and tested against test suites
  5. Scores computed based on compilation success and test pass rate

Task Categories

CategoryCountExamples
basic-algorithms15Factorial, Fibonacci, IsPrime, GCD, Power
safety10SafeDivide, Clamp, SafeModulo, NormalizeScore
data-structures10Sum, Max, Min, Average, Median
logic15BoolToInt, LogicalAnd, IsMultipleOf, SameSign

Scoring Formula

Plain Text
Score = 0.4 × compilation + 0.6 × tests
FactorWeightDescription
Compilation40%Does the generated code compile?
Test Cases60%Percentage of test cases passing

No bonus for contracts - we measure task completion fairly.

Results Summary

MetricCalorC#
Average Score1.001.00
Advantage Ratio1.00x (Tie)-
Compilation Rate100%100%
Test Pass Rate100%100%

The 1.00x ratio demonstrates that Calor is as learnable as C# - LLMs can write correct code in both languages equally well.

Contract Usage (Qualitative)

While task completion scores are equal, examining the generated code shows Calor consistently produces contracts:

RequirementCalorC#
"must not be zero"§Q (!= b 0)if (b == 0) throw...
"result is never negative"§S (>= result 0)(no equivalent)
"must be between X and Y"§Q (>= n X) + §Q (<= n Y)Two guard clauses

Calor includes postconditions that C# cannot express - a qualitative advantage not captured by task completion scores.

Running Locally

Bash
# Run LLM benchmark with neutral prompts
dotnet run --project tests/Calor.Evaluation -- llm-tasks \
  --manifest tests/Calor.Evaluation/Tasks/task-manifest-neutral.json \
  --verbose

# Refresh cache (re-run all tasks)
dotnet run --project tests/Calor.Evaluation -- llm-tasks \
  --manifest tests/Calor.Evaluation/Tasks/task-manifest-neutral.json \
  --refresh-cache

Cost Controls

  • Caching: Results cached to avoid redundant API calls
  • Budget caps: Maximum spend per run (~$0.76 for 50 tasks)
  • Estimated cost displayed before running

LLM-Based Evaluation

Multi-LLM Validation

To ensure unbiased results, we test with multiple LLMs:

ProviderModelPurpose
AnthropicClaude 3.5 SonnetPrimary evaluation
OpenAIGPT-4oCross-validation

Comprehension Questions

LLMs answer questions about code understanding:

  1. What is the main purpose of this code?
  2. What are the input parameters and their constraints?
  3. What does this function return?
  4. What side effects does this code have?
  5. What would happen if [edge case]?

Scoring

  • Correctness of answers (0-1) graded against ground truth
  • Tokens used to formulate answer
  • Consistency across multiple runs
  • Cross-model agreement (higher = more reliable)

Running the Evaluation

Basic Run

Bash
# JSON output (default)
dotnet run --project tests/Calor.Evaluation -- run --output report.json

# Markdown output
dotnet run --project tests/Calor.Evaluation -- run --format markdown --output report.md

# Website dashboard format
dotnet run --project tests/Calor.Evaluation -- run --format website --output results.json

# HTML dashboard
dotnet run --project tests/Calor.Evaluation -- run --format html --output dashboard.html

Statistical Analysis

Bash
# Run with 30 statistical samples
dotnet run --project tests/Calor.Evaluation -- run --statistical --runs 30

# Custom run count
dotnet run --project tests/Calor.Evaluation -- run --statistical --runs 50

Specific Metrics

Bash
# Run only specific categories
dotnet run --project tests/Calor.Evaluation -- run \
  --category Comprehension \
  --category EditPrecision \
  --category RefactoringStability

Interpreting Results

Winner Determination

For each metric:

  • Higher is better (Comprehension, Error Detection, etc.): Calor wins if ratio > 1.0
  • Lower is better (Token Economics): C# wins if ratio < 1.0

Statistical Significance

With statistical mode enabled:

  • p < 0.05: Result is statistically significant
  • Cohen's d: Effect size interpretation
    • d < 0.2: Negligible
    • d < 0.5: Small
    • d < 0.8: Medium
    • d ≥ 0.8: Large

The Tradeoff

Results consistently show:

  • Calor wins on comprehension metrics (structure, contracts, effects, refactoring)
  • C# wins on efficiency metrics (tokens, density)

This reflects the designed tradeoff: explicit semantics require more tokens but enable better agent reasoning and safer refactoring.


CI/CD Integration

Automated Benchmarks

Benchmarks run automatically via GitHub Actions:

  • On every push: Quick validation run
  • Weekly (Sunday): Full statistical analysis with LLM evaluation
  • Manual trigger: On-demand with custom parameters

Regression Detection

The CI pipeline checks for:

  • Any metric dropping by more than 10%
  • Statistical significance of changes
  • Failing compilation in any benchmark

Limitations

  1. Corpus size: 40 programs may not cover all patterns
  2. C# baseline: Other languages might perform differently
  3. Static analysis: Some metrics don't capture runtime behavior
  4. LLM variance: Model updates can affect evaluation scores

Next