Methodology
How the Calor evaluation framework measures language effectiveness for AI agents.
Evaluation Approach
The framework compares Calor and C# implementations of the same programs across multiple dimensions. It combines deterministic static analysis with optional LLM-based evaluation for a comprehensive assessment.
Statistical Rigor
All metrics support statistical analysis mode:
- Multiple runs (default: n=30) for variance measurement
- 95% confidence intervals for all ratios
- Cohen's d effect size calculations
- Paired t-tests for statistical significance (p < 0.05)
# Run with statistical analysis
dotnet run --project tests/Calor.Evaluation -- --statistical --runs 30Test Corpus
Current Scale
- 40 programs across multiple categories
- Each program has paired Calor (.calr) and C# (.cs) implementations
- Complexity levels 1-5 (simple to advanced)
Categories
| Category | Count | Examples |
|---|---|---|
| Basic Functions | 10 | HelloWorld, Calculator, FizzBuzz |
| Data Structures | 4 | Stack, Queue, LinkedList, BinaryTree |
| Design Patterns | 4 | Singleton, Factory, ShoppingCart |
| Algorithms | 3 | BinarySearch, QuickSort, Fibonacci |
| Contract Verification | 7 | ProvableContracts, OverflowSafe, StringContracts |
| Effect Soundness | 3 | CorrectEffects, MissingEffects, HiddenNetworkEffect |
| Interop Effects | 2 | BclCoverage, PureFunctions |
Requirements
Both implementations must:
- Compile successfully
- Produce identical output for the same inputs
- Have the same logical structure
The Metrics
1. Token Economics
Measures: Tokens required to represent equivalent logic.
Method:
- Simple tokenization (split on whitespace and punctuation)
- Character count (excluding whitespace)
- Line count
- Composite ratio
Interpretation: Lower is better (less context window usage).
2. Generation Accuracy
Measures: Ability to generate valid code.
Factors:
- Compilation success (50%)
- Structural completeness (30%)
- Error count (20%)
Structural completeness:
- Calor: Module, functions, bodies present
- C#: Namespace, class, methods present
3. Comprehension
Measures: How easily an agent can understand code structure.
Calor factors:
- Module declarations (
§M{) - Function declarations (
§F{) - Input/output annotations (
§I{,§O{) - Effect declarations (
§E{) - Contracts (
§REQ,§ENS) - Closing tags (
§/)
C# factors:
- Namespace declarations
- Class declarations
- Documentation comments (
///) - Type annotations
- Contract patterns
Scoring: Weighted sum of factors present, normalized to 0-1.
4. Edit Precision
Measures: Ability to target specific code elements accurately.
Approach: Simulates actual edit tasks and measures success rate:
- Change loop bounds by ID
- Add preconditions to functions
- Rename functions (ID vs name-based)
- Change function signatures
- Modify return types
Calor advantages:
- Unique IDs enable precise targeting (
§F{f001:) - Closing tags define clear boundaries
- ID-based references don't break on rename
C# challenges:
- Name collisions reduce targeting accuracy
- Brace nesting creates ambiguity
- Cascading changes needed for renames
Scoring: 40% structural analysis + 60% simulated edit success rate
5. Error Detection
Measures: Bug detection capability through explicit contracts.
Calor factors:
- Preconditions (
§REQ) - +0.25 - Postconditions (
§ENS) - +0.20 - Invariants (
§INV) - +0.15 - Effect declarations - +0.10
C# factors:
Debug.AssertstatementsContract.Requires/Contract.Ensures- Null checks
- Exception handling
6. Information Density
Measures: Semantic content per token.
Semantic elements counted:
- Calor: Modules, functions, variables, type annotations, contracts, effects, control flow, expressions
- C#: Namespaces, classes, methods, variables, type annotations, control flow, expressions
Density formula: Total semantic elements / token count
7. Refactoring Stability
Measures: How well unique IDs preserve references during code transformations.
Scenarios tested:
| Scenario | What's Measured |
|---|---|
| Rename function | Does ID survive when name changes? |
| Extract method | Is new ID assigned, original preserved? |
| Move function | Does ID survive cross-module move? |
| Change signature | Do callers update correctly? |
| Inline variable | Do references resolve correctly? |
Scoring weights:
- ID preservation: 30%
- Reference validity after edit: 25%
- Minimal diff size: 20%
- Semantic equivalence: 25%
LLM-Based Metrics
These metrics use actual LLM code generation to measure real-world effectiveness:
8. Task Completion
Measures: LLM code generation success using language-neutral prompts.
Method:
- 50 programming tasks across 4 categories
- Same prompt given to both Calor and C# (no language-specific hints)
- Calor generation uses a skills file teaching Contract-First Methodology
- Generated code is compiled and tested against test suites
Scoring: 40% compilation success + 60% test pass rate
See the LLM Task Completion Benchmark section below for details.
9. Safety
Measures: Contract enforcement effectiveness and error quality for catching bugs.
Method:
- Tests whether Calor contracts catch more bugs than C# guard clauses
- Evaluates error message quality (informative vs cryptic)
- Measures violation detection rate across input categories
What it catches:
- Division by zero
- Array bounds violations
- Integer overflow
- Null dereferences
- Invalid argument values
10. Effect Discipline
Measures: Side effect management quality and bug prevention.
Method:
- Tests whether code prevents real-world bugs caused by hidden side effects
- Evaluates effect declaration completeness
- Measures bug prevention rate for effect-related issues
What it catches:
- Flaky tests (non-determinism from hidden state)
- Security violations (unauthorized I/O)
- Side effect transparency issues
- Cache safety problems (memoization correctness)
Calor-Only Metrics
11. Interop Effect Coverage
Measures: BCL methods covered by effect manifests.
The BCL effect manifest tracks which .NET methods have effects, enabling verification even when calling external code. This has no C# equivalent since C# lacks an effect system.
These metrics explain why Calor achieves its overall advantage despite losing on token efficiency: the semantic verification pipeline and effect system catch classes of bugs that syntax-only tools cannot detect.
LLM Task Completion Benchmark
The Task Completion metric uses a dedicated LLM benchmark that directly measures AI code generation success using language-neutral prompts.
Design Philosophy: Neutral Prompts
The benchmark uses the same functional requirements for both languages without syntax hints:
Write a public function named Factorial that computes the factorial
of an integer n. The function must only accept non-negative values
of n. The result is always at least 1.This tests whether an LLM can learn Calor and translate requirements into idiomatic code, rather than simply following syntax instructions.
How It Works
- 50 programming tasks defined across 4 categories
- Same prompt given to both Calor and C# (no language-specific hints)
- Calor generation uses a skills file teaching Contract-First Methodology
- Generated code is compiled and tested against test suites
- Scores computed based on compilation success and test pass rate
Task Categories
| Category | Count | Examples |
|---|---|---|
| basic-algorithms | 15 | Factorial, Fibonacci, IsPrime, GCD, Power |
| safety | 10 | SafeDivide, Clamp, SafeModulo, NormalizeScore |
| data-structures | 10 | Sum, Max, Min, Average, Median |
| logic | 15 | BoolToInt, LogicalAnd, IsMultipleOf, SameSign |
Scoring Formula
Score = 0.4 × compilation + 0.6 × tests| Factor | Weight | Description |
|---|---|---|
| Compilation | 40% | Does the generated code compile? |
| Test Cases | 60% | Percentage of test cases passing |
No bonus for contracts - we measure task completion fairly.
Results Summary
| Metric | Calor | C# |
|---|---|---|
| Average Score | 1.00 | 1.00 |
| Advantage Ratio | 1.00x (Tie) | - |
| Compilation Rate | 100% | 100% |
| Test Pass Rate | 100% | 100% |
The 1.00x ratio demonstrates that Calor is as learnable as C# - LLMs can write correct code in both languages equally well.
Contract Usage (Qualitative)
While task completion scores are equal, examining the generated code shows Calor consistently produces contracts:
| Requirement | Calor | C# |
|---|---|---|
| "must not be zero" | §Q (!= b 0) | if (b == 0) throw... |
| "result is never negative" | §S (>= result 0) | (no equivalent) |
| "must be between X and Y" | §Q (>= n X) + §Q (<= n Y) | Two guard clauses |
Calor includes postconditions that C# cannot express - a qualitative advantage not captured by task completion scores.
Running Locally
# Run LLM benchmark with neutral prompts
dotnet run --project tests/Calor.Evaluation -- llm-tasks \
--manifest tests/Calor.Evaluation/Tasks/task-manifest-neutral.json \
--verbose
# Refresh cache (re-run all tasks)
dotnet run --project tests/Calor.Evaluation -- llm-tasks \
--manifest tests/Calor.Evaluation/Tasks/task-manifest-neutral.json \
--refresh-cacheCost Controls
- Caching: Results cached to avoid redundant API calls
- Budget caps: Maximum spend per run (~$0.76 for 50 tasks)
- Estimated cost displayed before running
LLM-Based Evaluation
Multi-LLM Validation
To ensure unbiased results, we test with multiple LLMs:
| Provider | Model | Purpose |
|---|---|---|
| Anthropic | Claude 3.5 Sonnet | Primary evaluation |
| OpenAI | GPT-4o | Cross-validation |
Comprehension Questions
LLMs answer questions about code understanding:
- What is the main purpose of this code?
- What are the input parameters and their constraints?
- What does this function return?
- What side effects does this code have?
- What would happen if [edge case]?
Scoring
- Correctness of answers (0-1) graded against ground truth
- Tokens used to formulate answer
- Consistency across multiple runs
- Cross-model agreement (higher = more reliable)
Running the Evaluation
Basic Run
# JSON output (default)
dotnet run --project tests/Calor.Evaluation -- run --output report.json
# Markdown output
dotnet run --project tests/Calor.Evaluation -- run --format markdown --output report.md
# Website dashboard format
dotnet run --project tests/Calor.Evaluation -- run --format website --output results.json
# HTML dashboard
dotnet run --project tests/Calor.Evaluation -- run --format html --output dashboard.htmlStatistical Analysis
# Run with 30 statistical samples
dotnet run --project tests/Calor.Evaluation -- run --statistical --runs 30
# Custom run count
dotnet run --project tests/Calor.Evaluation -- run --statistical --runs 50Specific Metrics
# Run only specific categories
dotnet run --project tests/Calor.Evaluation -- run \
--category Comprehension \
--category EditPrecision \
--category RefactoringStabilityInterpreting Results
Winner Determination
For each metric:
- Higher is better (Comprehension, Error Detection, etc.): Calor wins if ratio > 1.0
- Lower is better (Token Economics): C# wins if ratio < 1.0
Statistical Significance
With statistical mode enabled:
- p < 0.05: Result is statistically significant
- Cohen's d: Effect size interpretation
- d < 0.2: Negligible
- d < 0.5: Small
- d < 0.8: Medium
- d ≥ 0.8: Large
The Tradeoff
Results consistently show:
- Calor wins on comprehension metrics (structure, contracts, effects, refactoring)
- C# wins on efficiency metrics (tokens, density)
This reflects the designed tradeoff: explicit semantics require more tokens but enable better agent reasoning and safer refactoring.
CI/CD Integration
Automated Benchmarks
Benchmarks run automatically via GitHub Actions:
- On every push: Quick validation run
- Weekly (Sunday): Full statistical analysis with LLM evaluation
- Manual trigger: On-demand with custom parameters
Regression Detection
The CI pipeline checks for:
- Any metric dropping by more than 10%
- Statistical significance of changes
- Failing compilation in any benchmark
Limitations
- Corpus size: 40 programs may not cover all patterns
- C# baseline: Other languages might perform differently
- Static analysis: Some metrics don't capture runtime behavior
- LLM variance: Model updates can affect evaluation scores
Next
- Results - Detailed results table
- Individual Metrics - Deep dive into each metric