Task Completion

Category: Task Completion Result: Tie (1.00x) What it measures: AI agent ability to learn and write correct code in both languages


Overview

The Task Completion benchmark measures how successfully AI agents can complete programming tasks using Calor vs C#. This benchmark uses language-neutral prompts - the same functional requirements given to both languages without syntax hints - to fairly measure whether Calor is as learnable and usable as an established language like C#.


What This Benchmark Answers

Primary Question: Can an LLM learn Calor and write correct code as easily as C#?

Answer: Yes. With neutral prompts, both languages achieve identical results (1.00x ratio).

This is a significant finding - it proves Calor is learnable and doesn't impose a productivity penalty compared to C#.


Methodology

Language-Neutral Prompts

Unlike benchmarks that give language-specific guidance, our prompts describe what to achieve without dictating how:

Plain Text
Write a public function named Factorial that computes the factorial
of an integer n. The function must only accept non-negative values
of n. The result is always at least 1.

The same prompt goes to both Calor and C#. The LLM must apply its knowledge of each language to produce idiomatic code.

Why Neutral Prompts Matter

ApproachProblem
Calor-biased prompts"Add §Q precondition..." teaches syntax, doesn't test learning
C#-biased promptsUnfair to Calor, doesn't reflect real usage
Neutral promptsTests whether LLM can translate requirements to idiomatic code

Task Corpus

The benchmark includes 50 programming tasks across 4 categories:

CategoryTasksExamples
basic-algorithms15Factorial, Fibonacci, IsPrime, GCD, Power
safety10SafeDivide, Clamp, SafeModulo, NormalizeScore
data-structures10Sum, Max, Min, Average, Median
logic15BoolToInt, LogicalAnd, IsMultipleOf, SameSign

Scoring Formula

Each task is scored on compilation and test results:

FactorWeightDescription
Compilation40%Does the generated code compile?
Test Cases60%What percentage of test cases pass?

Final score: 0.4 × compile + 0.6 × tests


Results

Overall

MetricCalorC#Advantage
Average Score1.001.001.00x (Tie)
Compilation Rate100%100%Tie
Test Pass Rate100%100%Tie

By Category

CategoryCalor ScoreC# ScoreAdvantage
basic-algorithms1.001.001.00x
safety1.001.001.00x
data-structures1.001.001.00x
logic1.001.001.00x

What This Tells Us

Calor is Learnable

The 1.00x ratio demonstrates that:

  • LLMs can learn Calor syntax from skills files
  • Calor code compiles at the same rate as C#
  • Calor implementations pass tests at the same rate as C#

Task Completion ≠ Code Value

This benchmark measures task completion - can correct code be written? It does not measure:

  • Whether the code has better safety properties
  • Whether contracts catch bugs that C# misses
  • Whether the code is more maintainable

These are different questions that require different benchmarks.


Example: Generated Code Comparison

Task: SafeDivide

Prompt: "Write a public function named SafeDivide that divides integer a by integer b. Division by zero must not be allowed."

Generated Calor:

Plain Text
§M{m001:SafeDivideModule}
§F{f001:SafeDivide:pub}
  §I{i32:a}
  §I{i32:b}
  §O{i32}
  §Q (!= b 0)
  §R (/ a b)
§/F{f001}
§/M{m001}

Generated C#:

C#
public static class Functions
{
    public static int SafeDivide(int a, int b)
    {
        if (b == 0)
            throw new ArgumentException("Division by zero");
        return a / b;
    }
}

Both compile and pass tests. But note that Calor uses a contract (§Q (!= b 0)) while C# uses a guard clause. This difference doesn't affect task completion scores, but affects code properties.


Task: NormalizeScore

Prompt: "Write a public function named NormalizeScore that normalizes a score to the 0-100 range using the formula (score * 100) / maxScore. The maxScore must be positive, score must be non-negative, and score must not exceed maxScore."

Generated Calor:

Plain Text
§F{f001:NormalizeScore:pub}
  §I{i32:score}
  §I{i32:maxScore}
  §O{i32}
  §Q (>= score 0)
  §Q (> maxScore 0)
  §Q (<= score maxScore)
  §S (>= result 0)
  §S (<= result 100)
  §R (/ (* score 100) maxScore)
§/F{f001}

Generated C#:

C#
public static int NormalizeScore(int score, int maxScore)
{
    if (maxScore <= 0)
        throw new ArgumentException("maxScore must be positive");
    if (score < 0)
        throw new ArgumentException("score must be non-negative");
    if (score > maxScore)
        throw new ArgumentException("score must not exceed maxScore");

    return (score * 100) / maxScore;
}

Key difference: Calor includes postconditions (§S (>= result 0) and §S (<= result 100)) that guarantee output bounds. C# has no equivalent - bugs in the output go undetected.


Contract Usage Analysis

When examining generated Calor code, the LLM consistently extracts constraints from requirements into contracts:

Requirement LanguageGenerated Contract
"must only accept non-negative"§Q (>= n 0)
"must not be zero"§Q (!= n 0)
"result is never negative"§S (>= result 0)
"result is always at least 1"§S (>= result 1)
"must be between X and Y"§Q (>= n X) + §Q (<= n Y)

This demonstrates that the Contract-First Methodology in the skills file is working - the LLM translates requirement language into executable contracts.


Benchmark Execution

Running Locally

Bash
# Run the LLM benchmark with neutral prompts
dotnet run --project tests/Calor.Evaluation -- llm-tasks \
  --manifest tests/Calor.Evaluation/Tasks/task-manifest-neutral.json \
  --verbose

# Refresh cache (re-run all tasks)
dotnet run --project tests/Calor.Evaluation -- llm-tasks \
  --manifest tests/Calor.Evaluation/Tasks/task-manifest-neutral.json \
  --refresh-cache

Cost Controls

  • Results are cached to avoid redundant API calls
  • Budget caps prevent runaway costs
  • Estimated cost: ~$0.76 for 50 tasks

Model Information

PropertyValue
ProviderAnthropic
ModelClaude Sonnet 4
Model IDclaude-sonnet-4-20250514

Transparency

Reproducibility

All benchmark data is available:

  • Task definitions: tests/Calor.Evaluation/Tasks/task-manifest-neutral.json
  • Skills file: tests/Calor.Evaluation/Skills/calor-language-skills.md
  • Results: llm-results.json (includes generated code for every task)

Limitations

  • Results may vary with model updates
  • 50 tasks is a limited sample size
  • Simple tasks may not reveal differences in more complex scenarios

Future: Safety Benchmark

Task completion measures "can correct code be written?" A separate Safety Benchmark would measure "does the code catch more bugs?" by:

  1. Running adversarial tests with invalid inputs
  2. Measuring contract enforcement (Calor) vs exception handling (C#)
  3. Comparing error message quality and precision

This would highlight Calor's contract advantages in scenarios where C# code might silently fail or produce incorrect results.


Next