Correctness

Category: Code Correctness Result: 1.0x (tie) — both languages achieve 100% What it measures: Bug prevention through correct edge case handling


Overview

The Correctness Benchmark is a fair, unbiased comparison that measures how well code handles edge cases and prevents bugs. Unlike feature-specific benchmarks that measure mechanisms, this benchmark measures outcomes: did the code produce the correct result?

Key principle: A bug prevented is a bug prevented, regardless of HOW it was prevented.

  • Calor contracts catching a bug = pass
  • C# guard clauses catching a bug = pass
  • Test pass/fail is the only metric

Both languages can achieve 100% on this benchmark through correct code.


What This Benchmark Answers

Primary Question: Which language produces code that handles edge cases correctly?

ScenarioCalor ApproachC# Approach
Null inputPrecondition §Q (!= x null)Guard clause x ?? throw
Division by zeroPrecondition §Q (!= b 0)Guard clause if (b == 0)
Empty arrayPrecondition §Q (> len 0)if (arr.Length == 0)
Integer overflowPostcondition on resultchecked { } block

Both approaches are valid. The benchmark measures outcomes, not mechanisms.


Methodology

Pure Pass/Fail Scoring

Unlike other benchmarks that weight multiple factors, Correctness uses simple pass/fail:

Plain Text
Score = Tests Passed / Total Tests

No style points. No partial credit. Either the code produces the correct output or it doesn't.

Test Case Categories

Each task includes two types of tests:

  1. Normal cases - Standard inputs that both languages should handle
  2. Edge cases - Boundary conditions that often cause bugs
JSON
{
  "id": "correct-001",
  "name": "SafeDivide",
  "testCases": [
    { "input": [10, 2], "expected": 5 },
    { "input": [100, 10], "expected": 10 },
    { "input": [10, 0], "expected": 0, "isEdgeCase": true },
    { "input": [0, 5], "expected": 0, "isEdgeCase": true }
  ]
}

Scoring Breakdown

MetricWhat It Measures
Overall ScoreAll tests passed / All tests
Edge Case ScoreEdge case tests passed / Edge case tests
Normal Case ScoreNormal tests passed / Normal tests

The edge case score is particularly important as it reveals how well each language handles boundary conditions.


Task Categories

Null Handling (5 tasks)

Tests proper null checks and default returns:

  • Null string processing
  • Null array handling
  • Optional parameter defaults

Arithmetic Safety (5 tasks)

Tests numeric edge cases:

  • Division by zero
  • Integer overflow
  • Negative number handling
  • Boundary arithmetic

String Processing (5 tasks)

Tests string edge cases:

  • Empty string handling
  • Whitespace-only strings
  • Very long strings
  • Unicode edge cases

Collection Operations (5 tasks)

Tests collection edge cases:

  • Empty array/list operations
  • Single-element collections
  • Index boundary conditions

Boundary Conditions (5 tasks)

Tests general boundary handling:

  • Min/max value inputs
  • Zero vs non-zero behavior
  • Off-by-one prevention

Example Task

Task: Find Maximum Value

Prompt: Write a function that finds the maximum value in an array of integers.

Test Cases:

InputExpectedType
[1, 5, 3, 9, 2]9Normal
[42]42Edge (single element)
[]0 or exceptionEdge (empty array)
[-5, -1, -10]-1Edge (all negative)

Calor Implementation:

Plain Text
§M{m001:Math}
§F{f001:FindMax:pub}
  §I{[i32]:arr}
  §O{i32}
  §Q (!= arr null)             // First: null check
  §Q (> (len arr) 0)           // Then: requires non-empty array

  §B{max} §IDX arr 0
  §B{i} 1
  §WH{wh1} (< i (len arr))
    §B{current} §IDX arr i
    §IF{if1} (> current max)
      §ASSIGN max current
    §/I{if1}
    §ASSIGN i (+ i 1)
  §/WH{wh1}
  §R max
§/F{f001}
§/M{m001}

C# Implementation:

C#
public static int FindMax(int[] arr)
{
    if (arr == null)
        throw new ArgumentNullException(nameof(arr));
    if (arr.Length == 0)
        throw new ArgumentException("Array cannot be empty", nameof(arr));

    int max = arr[0];
    for (int i = 1; i < arr.Length; i++)
    {
        if (arr[i] > max)
            max = arr[i];
    }
    return max;
}

Both implementations achieve 100% correctness. The benchmark recognizes both Calor's ContractViolationException and C#'s ArgumentException as valid contract enforcement.


Why This Benchmark Is Fair

No Language Bias

  1. Same prompts - Both languages receive identical task descriptions
  2. Same tests - Identical test cases run against both implementations
  3. Same scoring - Pure pass/fail, no subjective evaluation

Real-World Relevance

Edge cases tested are common sources of production bugs:

  • Null pointer exceptions
  • Division by zero
  • Index out of bounds
  • Empty collection errors
  • Integer overflow

Developers encounter these daily regardless of language.

Both Languages Can Win

ScenarioCalor Wins WhenC# Wins When
Null handlingPrecondition catches itGuard clause catches it
Empty arrayContract requires non-empty.Length check prevents error
OverflowPostcondition bounds resultchecked block catches it

The benchmark rewards correctness, not specific mechanisms.


Benchmark Execution

Running Locally

Bash
# Run the correctness benchmark
dotnet run --project tests/Calor.Evaluation -- run \
  --category Correctness \
  --verbose

# View detailed results
cat report.json | jq '.metrics.Correctness'

Using the LLM-Based Runner

For actual code generation comparison (requires API key):

Bash
# Run with Claude provider
ANTHROPIC_API_KEY=your-key dotnet run --project tests/Calor.Evaluation -- \
  correctness-benchmark \
  --manifest tests/Calor.Evaluation/Tasks/task-manifest-correctness.json \
  --verbose

# Dry run to estimate costs
dotnet run --project tests/Calor.Evaluation -- correctness-benchmark --dry-run

Current Results

Summary

MetricCalorC#
Overall Score100%100%
Edge Case Score100%100%
Normal Case Score100%100%
Advantage Ratio1.0x1.0x

By Category

CategoryCalor ScoreC# ScoreWinner
Precondition Enforcement100%100%Tie
Postcondition Guarantee100%100%Tie
Combined Contracts100%100%Tie

Key Finding

Both languages achieve equal correctness when benchmarked fairly:

  • Calor: Preconditions (§Q) and postconditions (§S) provide contract-based protection
  • C#: Guard clauses (ArgumentException, ArgumentNullException) provide equivalent protection

The benchmark accepts both mechanisms as valid contract enforcement. This ensures a fair comparison of actual code quality rather than language-specific features.


Transparency

Reproducibility

All benchmark data is available:

  • Task definitions: tests/Calor.Evaluation/Tasks/task-manifest-correctness.json
  • Runner: tests/Calor.Evaluation/LlmTasks/CorrectnessBenchmarkRunner.cs
  • Calculator: tests/Calor.Evaluation/Metrics/CorrectnessCalculator.cs

Limitations

  • Estimation mode - Without an LLM provider, uses heuristic scoring
  • Sample size - 25 tasks across 5 categories
  • LLM variability - Different runs may produce different generated code

What This Benchmark Does NOT Measure

  • Code style or readability
  • Execution performance
  • Token efficiency
  • Contract enforcement quality (see Safety benchmark)

This benchmark only measures: Does the code produce correct results?


Comparison with Safety Benchmark

AspectCorrectnessSafety
FocusCorrect outputBug detection
ScoringPass/fail onlyError quality weighted
Edge casesReturn correct defaultThrow informative exception
C# can winYes, with guard clausesYes, with good exceptions

Use Correctness to evaluate "does it work?" and Safety to evaluate "does it catch bugs well?"


Next