Task Completion

Category: Task Completion Result: Tie (1.00x) What it measures: AI agent ability to learn and write correct code in both languages

Overview

The Task Completion benchmark measures how successfully AI agents can complete programming tasks using Calor vs C#. This benchmark uses language-neutral prompts - the same functional requirements given to both languages without syntax hints - to fairly measure whether Calor is as learnable and usable as an established language like C#.

What This Benchmark Answers

Primary Question: Can an LLM learn Calor and write correct code as easily as C#?

Answer: Yes. With neutral prompts, both languages achieve identical results (1.00x ratio).

This is a significant finding - it proves Calor is learnable and doesn't impose a productivity penalty compared to C#.

Methodology

Language-Neutral Prompts

Unlike benchmarks that give language-specific guidance, our prompts describe what to achieve without dictating how:

Plain Text

Write a public function named Factorial that computes the factorial
of an integer n. The function must only accept non-negative values
of n. The result is always at least 1.

The same prompt goes to both Calor and C#. The LLM must apply its knowledge of each language to produce idiomatic code.

Why Neutral Prompts Matter

Approach	Problem
Calor-biased prompts	"Add §Q precondition..." teaches syntax, doesn't test learning
C#-biased prompts	Unfair to Calor, doesn't reflect real usage
Neutral prompts	Tests whether LLM can translate requirements to idiomatic code

Task Corpus

The benchmark includes 50 programming tasks across 4 categories:

Category	Tasks	Examples
basic-algorithms	15	Factorial, Fibonacci, IsPrime, GCD, Power
safety	10	SafeDivide, Clamp, SafeModulo, NormalizeScore
data-structures	10	Sum, Max, Min, Average, Median
logic	15	BoolToInt, LogicalAnd, IsMultipleOf, SameSign

Scoring Formula

Each task is scored on compilation and test results:

Factor	Weight	Description
Compilation	40%	Does the generated code compile?
Test Cases	60%	What percentage of test cases pass?

Final score: 0.4 × compile + 0.6 × tests

Results

Overall

Metric	Calor	C#	Advantage
Average Score	1.00	1.00	1.00x (Tie)
Compilation Rate	100%	100%	Tie
Test Pass Rate	100%	100%	Tie

By Category

Category	Calor Score	C# Score	Advantage
basic-algorithms	1.00	1.00	1.00x
safety	1.00	1.00	1.00x
data-structures	1.00	1.00	1.00x
logic	1.00	1.00	1.00x

What This Tells Us

Calor is Learnable

The 1.00x ratio demonstrates that:

LLMs can learn Calor syntax from skills files
Calor code compiles at the same rate as C#
Calor implementations pass tests at the same rate as C#

Task Completion ≠ Code Value

This benchmark measures task completion - can correct code be written? It does not measure:

Whether the code has better safety properties
Whether contracts catch bugs that C# misses
Whether the code is more maintainable

These are different questions that require different benchmarks.

Example: Generated Code Comparison

Task: SafeDivide

Prompt: "Write a public function named SafeDivide that divides integer a by integer b. Division by zero must not be allowed."

Generated Calor:

Plain Text

§M{m001:SafeDivideModule}
§F{f001:SafeDivide:pub}
  §I{i32:a}
  §I{i32:b}
  §O{i32}
  §Q (!= b 0)
  §R (/ a b)
§/F{f001}
§/M{m001}

Generated C#:

public static class Functions
{
    public static int SafeDivide(int a, int b)
    {
        if (b == 0)
            throw new ArgumentException("Division by zero");
        return a / b;
    }
}

Both compile and pass tests. But note that Calor uses a contract (§Q (!= b 0)) while C# uses a guard clause. This difference doesn't affect task completion scores, but affects code properties.

Task: NormalizeScore

Prompt: "Write a public function named NormalizeScore that normalizes a score to the 0-100 range using the formula (score * 100) / maxScore. The maxScore must be positive, score must be non-negative, and score must not exceed maxScore."

Generated Calor:

Plain Text

§F{f001:NormalizeScore:pub}
  §I{i32:score}
  §I{i32:maxScore}
  §O{i32}
  §Q (>= score 0)
  §Q (> maxScore 0)
  §Q (<= score maxScore)
  §S (>= result 0)
  §S (<= result 100)
  §R (/ (* score 100) maxScore)
§/F{f001}

Generated C#:

public static int NormalizeScore(int score, int maxScore)
{
    if (maxScore <= 0)
        throw new ArgumentException("maxScore must be positive");
    if (score < 0)
        throw new ArgumentException("score must be non-negative");
    if (score > maxScore)
        throw new ArgumentException("score must not exceed maxScore");

    return (score * 100) / maxScore;
}

Key difference: Calor includes postconditions (§S (>= result 0) and §S (<= result 100)) that guarantee output bounds. C# has no equivalent - bugs in the output go undetected.

Contract Usage Analysis

When examining generated Calor code, the LLM consistently extracts constraints from requirements into contracts:

Requirement Language	Generated Contract
"must only accept non-negative"	`§Q (>= n 0)`
"must not be zero"	`§Q (!= n 0)`
"result is never negative"	`§S (>= result 0)`
"result is always at least 1"	`§S (>= result 1)`
"must be between X and Y"	`§Q (>= n X)` + `§Q (<= n Y)`

This demonstrates that the Contract-First Methodology in the skills file is working - the LLM translates requirement language into executable contracts.

Benchmark Execution

Running Locally

Bash

# Run the LLM benchmark with neutral prompts
dotnet run --project tests/Calor.Evaluation -- llm-tasks \
  --manifest tests/Calor.Evaluation/Tasks/task-manifest-neutral.json \
  --verbose

# Refresh cache (re-run all tasks)
dotnet run --project tests/Calor.Evaluation -- llm-tasks \
  --manifest tests/Calor.Evaluation/Tasks/task-manifest-neutral.json \
  --refresh-cache

Cost Controls

Results are cached to avoid redundant API calls
Budget caps prevent runaway costs
Estimated cost: ~$0.76 for 50 tasks

Model Information

Property	Value
Provider	Anthropic
Model	Claude Sonnet 4
Model ID	`claude-sonnet-4-20250514`

Transparency

Reproducibility

All benchmark data is available:

Task definitions: tests/Calor.Evaluation/Tasks/task-manifest-neutral.json
Skills file: tests/Calor.Evaluation/Skills/calor-language-skills.md
Results: llm-results.json (includes generated code for every task)

Limitations

Results may vary with model updates
50 tasks is a limited sample size
Simple tasks may not reveal differences in more complex scenarios

Future: Safety Benchmark

Task completion measures "can correct code be written?" A separate Safety Benchmark would measure "does the code catch more bugs?" by:

Running adversarial tests with invalid inputs
Measuring contract enforcement (Calor) vs exception handling (C#)
Comparing error message quality and precision

This would highlight Calor's contract advantages in scenarios where C# code might silently fail or produce incorrect results.

Methodology - Full benchmark methodology
Token Economics - Token count comparison