Correctness

Category: Code Correctness Result: 1.0x (tie) — both languages achieve 100% What it measures: Bug prevention through correct edge case handling

Overview

The Correctness Benchmark is a fair, unbiased comparison that measures how well code handles edge cases and prevents bugs. Unlike feature-specific benchmarks that measure mechanisms, this benchmark measures outcomes: did the code produce the correct result?

Key principle: A bug prevented is a bug prevented, regardless of HOW it was prevented.

Calor contracts catching a bug = pass
C# guard clauses catching a bug = pass
Test pass/fail is the only metric

Both languages can achieve 100% on this benchmark through correct code.

What This Benchmark Answers

Primary Question: Which language produces code that handles edge cases correctly?

Scenario	Calor Approach	C# Approach
Null input	Precondition `§Q (!= x null)`	Guard clause `x ?? throw`
Division by zero	Precondition `§Q (!= b 0)`	Guard clause `if (b == 0)`
Empty array	Precondition `§Q (> len 0)`	`if (arr.Length == 0)`
Integer overflow	Postcondition on result	`checked { }` block

Both approaches are valid. The benchmark measures outcomes, not mechanisms.

Methodology

Pure Pass/Fail Scoring

Unlike other benchmarks that weight multiple factors, Correctness uses simple pass/fail:

Plain Text

Score = Tests Passed / Total Tests

No style points. No partial credit. Either the code produces the correct output or it doesn't.

Test Case Categories

Each task includes two types of tests:

Normal cases - Standard inputs that both languages should handle
Edge cases - Boundary conditions that often cause bugs

JSON

{
  "id": "correct-001",
  "name": "SafeDivide",
  "testCases": [
    { "input": [10, 2], "expected": 5 },
    { "input": [100, 10], "expected": 10 },
    { "input": [10, 0], "expected": 0, "isEdgeCase": true },
    { "input": [0, 5], "expected": 0, "isEdgeCase": true }
  ]
}

Scoring Breakdown

Metric	What It Measures
Overall Score	All tests passed / All tests
Edge Case Score	Edge case tests passed / Edge case tests
Normal Case Score	Normal tests passed / Normal tests

The edge case score is particularly important as it reveals how well each language handles boundary conditions.

Task Categories

Null Handling (5 tasks)

Tests proper null checks and default returns:

Null string processing
Null array handling
Optional parameter defaults

Arithmetic Safety (5 tasks)

Tests numeric edge cases:

Division by zero
Integer overflow
Negative number handling
Boundary arithmetic

String Processing (5 tasks)

Tests string edge cases:

Empty string handling
Whitespace-only strings
Very long strings
Unicode edge cases

Collection Operations (5 tasks)

Tests collection edge cases:

Empty array/list operations
Single-element collections
Index boundary conditions

Boundary Conditions (5 tasks)

Tests general boundary handling:

Min/max value inputs
Zero vs non-zero behavior
Off-by-one prevention

Example Task

Task: Find Maximum Value

Prompt: Write a function that finds the maximum value in an array of integers.

Test Cases:

Input	Expected	Type
`[1, 5, 3, 9, 2]`	`9`	Normal
`[42]`	`42`	Edge (single element)
`[]`	`0` or exception	Edge (empty array)
`[-5, -1, -10]`	`-1`	Edge (all negative)

Calor Implementation:

Plain Text

§M{m001:Math}
§F{f001:FindMax:pub}
  §I{[i32]:arr}
  §O{i32}
  §Q (!= arr null)             // First: null check
  §Q (> (len arr) 0)           // Then: requires non-empty array

  §B{max} §IDX arr 0
  §B{i} 1
  §WH{wh1} (< i (len arr))
    §B{current} §IDX arr i
    §IF{if1} (> current max)
      §ASSIGN max current
    §/I{if1}
    §ASSIGN i (+ i 1)
  §/WH{wh1}
  §R max
§/F{f001}
§/M{m001}

C# Implementation:

public static int FindMax(int[] arr)
{
    if (arr == null)
        throw new ArgumentNullException(nameof(arr));
    if (arr.Length == 0)
        throw new ArgumentException("Array cannot be empty", nameof(arr));

    int max = arr[0];
    for (int i = 1; i < arr.Length; i++)
    {
        if (arr[i] > max)
            max = arr[i];
    }
    return max;
}

Both implementations achieve 100% correctness. The benchmark recognizes both Calor's ContractViolationException and C#'s ArgumentException as valid contract enforcement.

Why This Benchmark Is Fair

No Language Bias

Same prompts - Both languages receive identical task descriptions
Same tests - Identical test cases run against both implementations
Same scoring - Pure pass/fail, no subjective evaluation

Real-World Relevance

Edge cases tested are common sources of production bugs:

Null pointer exceptions
Division by zero
Index out of bounds
Empty collection errors
Integer overflow

Developers encounter these daily regardless of language.

Both Languages Can Win

Scenario	Calor Wins When	C# Wins When
Null handling	Precondition catches it	Guard clause catches it
Empty array	Contract requires non-empty	`.Length` check prevents error
Overflow	Postcondition bounds result	`checked` block catches it

The benchmark rewards correctness, not specific mechanisms.

Benchmark Execution

Running Locally

Bash

# Run the correctness benchmark
dotnet run --project tests/Calor.Evaluation -- run \
  --category Correctness \
  --verbose

# View detailed results
cat report.json | jq '.metrics.Correctness'

Using the LLM-Based Runner

For actual code generation comparison (requires API key):

Bash

# Run with Claude provider
ANTHROPIC_API_KEY=your-key dotnet run --project tests/Calor.Evaluation -- \
  correctness-benchmark \
  --manifest tests/Calor.Evaluation/Tasks/task-manifest-correctness.json \
  --verbose

# Dry run to estimate costs
dotnet run --project tests/Calor.Evaluation -- correctness-benchmark --dry-run

Current Results

Summary

Metric	Calor	C#
Overall Score	100%	100%
Edge Case Score	100%	100%
Normal Case Score	100%	100%
Advantage Ratio	1.0x	1.0x

By Category

Category	Calor Score	C# Score	Winner
Precondition Enforcement	100%	100%	Tie
Postcondition Guarantee	100%	100%	Tie
Combined Contracts	100%	100%	Tie

Key Finding

Both languages achieve equal correctness when benchmarked fairly:

Calor: Preconditions (§Q) and postconditions (§S) provide contract-based protection
C#: Guard clauses (ArgumentException, ArgumentNullException) provide equivalent protection

The benchmark accepts both mechanisms as valid contract enforcement. This ensures a fair comparison of actual code quality rather than language-specific features.

Transparency

Reproducibility

All benchmark data is available:

Task definitions: tests/Calor.Evaluation/Tasks/task-manifest-correctness.json
Runner: tests/Calor.Evaluation/LlmTasks/CorrectnessBenchmarkRunner.cs
Calculator: tests/Calor.Evaluation/Metrics/CorrectnessCalculator.cs

Limitations

Estimation mode - Without an LLM provider, uses heuristic scoring
Sample size - 25 tasks across 5 categories
LLM variability - Different runs may produce different generated code

What This Benchmark Does NOT Measure

Code style or readability
Execution performance
Token efficiency
Contract enforcement quality (see Safety benchmark)

This benchmark only measures: Does the code produce correct results?

Comparison with Safety Benchmark

Aspect	Correctness	Safety
Focus	Correct output	Bug detection
Scoring	Pass/fail only	Error quality weighted
Edge cases	Return correct default	Throw informative exception
C# can win	Yes, with guard clauses	Yes, with good exceptions

Use Correctness to evaluate "does it work?" and Safety to evaluate "does it catch bugs well?"

Safety - Contract enforcement benchmark
Task Completion - LLM code generation success
Methodology - Full benchmark methodology