Correctness
Category: Code Correctness Result: 1.0x (tie) — both languages achieve 100% What it measures: Bug prevention through correct edge case handling
Overview
The Correctness Benchmark is a fair, unbiased comparison that measures how well code handles edge cases and prevents bugs. Unlike feature-specific benchmarks that measure mechanisms, this benchmark measures outcomes: did the code produce the correct result?
Key principle: A bug prevented is a bug prevented, regardless of HOW it was prevented.
- Calor contracts catching a bug = pass
- C# guard clauses catching a bug = pass
- Test pass/fail is the only metric
Both languages can achieve 100% on this benchmark through correct code.
What This Benchmark Answers
Primary Question: Which language produces code that handles edge cases correctly?
| Scenario | Calor Approach | C# Approach |
|---|---|---|
| Null input | Precondition §Q (!= x null) | Guard clause x ?? throw |
| Division by zero | Precondition §Q (!= b 0) | Guard clause if (b == 0) |
| Empty array | Precondition §Q (> len 0) | if (arr.Length == 0) |
| Integer overflow | Postcondition on result | checked { } block |
Both approaches are valid. The benchmark measures outcomes, not mechanisms.
Methodology
Pure Pass/Fail Scoring
Unlike other benchmarks that weight multiple factors, Correctness uses simple pass/fail:
Score = Tests Passed / Total TestsNo style points. No partial credit. Either the code produces the correct output or it doesn't.
Test Case Categories
Each task includes two types of tests:
- Normal cases - Standard inputs that both languages should handle
- Edge cases - Boundary conditions that often cause bugs
{
"id": "correct-001",
"name": "SafeDivide",
"testCases": [
{ "input": [10, 2], "expected": 5 },
{ "input": [100, 10], "expected": 10 },
{ "input": [10, 0], "expected": 0, "isEdgeCase": true },
{ "input": [0, 5], "expected": 0, "isEdgeCase": true }
]
}Scoring Breakdown
| Metric | What It Measures |
|---|---|
| Overall Score | All tests passed / All tests |
| Edge Case Score | Edge case tests passed / Edge case tests |
| Normal Case Score | Normal tests passed / Normal tests |
The edge case score is particularly important as it reveals how well each language handles boundary conditions.
Task Categories
Null Handling (5 tasks)
Tests proper null checks and default returns:
- Null string processing
- Null array handling
- Optional parameter defaults
Arithmetic Safety (5 tasks)
Tests numeric edge cases:
- Division by zero
- Integer overflow
- Negative number handling
- Boundary arithmetic
String Processing (5 tasks)
Tests string edge cases:
- Empty string handling
- Whitespace-only strings
- Very long strings
- Unicode edge cases
Collection Operations (5 tasks)
Tests collection edge cases:
- Empty array/list operations
- Single-element collections
- Index boundary conditions
Boundary Conditions (5 tasks)
Tests general boundary handling:
- Min/max value inputs
- Zero vs non-zero behavior
- Off-by-one prevention
Example Task
Task: Find Maximum Value
Prompt: Write a function that finds the maximum value in an array of integers.
Test Cases:
| Input | Expected | Type |
|---|---|---|
[1, 5, 3, 9, 2] | 9 | Normal |
[42] | 42 | Edge (single element) |
[] | 0 or exception | Edge (empty array) |
[-5, -1, -10] | -1 | Edge (all negative) |
Calor Implementation:
§M{m001:Math}
§F{f001:FindMax:pub}
§I{[i32]:arr}
§O{i32}
§Q (!= arr null) // First: null check
§Q (> (len arr) 0) // Then: requires non-empty array
§B{max} §IDX arr 0
§B{i} 1
§WH{wh1} (< i (len arr))
§B{current} §IDX arr i
§IF{if1} (> current max)
§ASSIGN max current
§/I{if1}
§ASSIGN i (+ i 1)
§/WH{wh1}
§R max
§/F{f001}
§/M{m001}C# Implementation:
public static int FindMax(int[] arr)
{
if (arr == null)
throw new ArgumentNullException(nameof(arr));
if (arr.Length == 0)
throw new ArgumentException("Array cannot be empty", nameof(arr));
int max = arr[0];
for (int i = 1; i < arr.Length; i++)
{
if (arr[i] > max)
max = arr[i];
}
return max;
}Both implementations achieve 100% correctness. The benchmark recognizes both Calor's ContractViolationException and C#'s ArgumentException as valid contract enforcement.
Why This Benchmark Is Fair
No Language Bias
- Same prompts - Both languages receive identical task descriptions
- Same tests - Identical test cases run against both implementations
- Same scoring - Pure pass/fail, no subjective evaluation
Real-World Relevance
Edge cases tested are common sources of production bugs:
- Null pointer exceptions
- Division by zero
- Index out of bounds
- Empty collection errors
- Integer overflow
Developers encounter these daily regardless of language.
Both Languages Can Win
| Scenario | Calor Wins When | C# Wins When |
|---|---|---|
| Null handling | Precondition catches it | Guard clause catches it |
| Empty array | Contract requires non-empty | .Length check prevents error |
| Overflow | Postcondition bounds result | checked block catches it |
The benchmark rewards correctness, not specific mechanisms.
Benchmark Execution
Running Locally
# Run the correctness benchmark
dotnet run --project tests/Calor.Evaluation -- run \
--category Correctness \
--verbose
# View detailed results
cat report.json | jq '.metrics.Correctness'Using the LLM-Based Runner
For actual code generation comparison (requires API key):
# Run with Claude provider
ANTHROPIC_API_KEY=your-key dotnet run --project tests/Calor.Evaluation -- \
correctness-benchmark \
--manifest tests/Calor.Evaluation/Tasks/task-manifest-correctness.json \
--verbose
# Dry run to estimate costs
dotnet run --project tests/Calor.Evaluation -- correctness-benchmark --dry-runCurrent Results
Summary
| Metric | Calor | C# |
|---|---|---|
| Overall Score | 100% | 100% |
| Edge Case Score | 100% | 100% |
| Normal Case Score | 100% | 100% |
| Advantage Ratio | 1.0x | 1.0x |
By Category
| Category | Calor Score | C# Score | Winner |
|---|---|---|---|
| Precondition Enforcement | 100% | 100% | Tie |
| Postcondition Guarantee | 100% | 100% | Tie |
| Combined Contracts | 100% | 100% | Tie |
Key Finding
Both languages achieve equal correctness when benchmarked fairly:
- Calor: Preconditions (
§Q) and postconditions (§S) provide contract-based protection - C#: Guard clauses (
ArgumentException,ArgumentNullException) provide equivalent protection
The benchmark accepts both mechanisms as valid contract enforcement. This ensures a fair comparison of actual code quality rather than language-specific features.
Transparency
Reproducibility
All benchmark data is available:
- Task definitions:
tests/Calor.Evaluation/Tasks/task-manifest-correctness.json - Runner:
tests/Calor.Evaluation/LlmTasks/CorrectnessBenchmarkRunner.cs - Calculator:
tests/Calor.Evaluation/Metrics/CorrectnessCalculator.cs
Limitations
- Estimation mode - Without an LLM provider, uses heuristic scoring
- Sample size - 25 tasks across 5 categories
- LLM variability - Different runs may produce different generated code
What This Benchmark Does NOT Measure
- Code style or readability
- Execution performance
- Token efficiency
- Contract enforcement quality (see Safety benchmark)
This benchmark only measures: Does the code produce correct results?
Comparison with Safety Benchmark
| Aspect | Correctness | Safety |
|---|---|---|
| Focus | Correct output | Bug detection |
| Scoring | Pass/fail only | Error quality weighted |
| Edge cases | Return correct default | Throw informative exception |
| C# can win | Yes, with guard clauses | Yes, with good exceptions |
Use Correctness to evaluate "does it work?" and Safety to evaluate "does it catch bugs well?"
Next
- Safety - Contract enforcement benchmark
- Task Completion - LLM code generation success
- Methodology - Full benchmark methodology