Task Completion
Category: Task Completion Result: Tie (1.00x) What it measures: AI agent ability to learn and write correct code in both languages
Overview
The Task Completion benchmark measures how successfully AI agents can complete programming tasks using Calor vs C#. This benchmark uses language-neutral prompts - the same functional requirements given to both languages without syntax hints - to fairly measure whether Calor is as learnable and usable as an established language like C#.
What This Benchmark Answers
Primary Question: Can an LLM learn Calor and write correct code as easily as C#?
Answer: Yes. With neutral prompts, both languages achieve identical results (1.00x ratio).
This is a significant finding - it proves Calor is learnable and doesn't impose a productivity penalty compared to C#.
Methodology
Language-Neutral Prompts
Unlike benchmarks that give language-specific guidance, our prompts describe what to achieve without dictating how:
Write a public function named Factorial that computes the factorial
of an integer n. The function must only accept non-negative values
of n. The result is always at least 1.The same prompt goes to both Calor and C#. The LLM must apply its knowledge of each language to produce idiomatic code.
Why Neutral Prompts Matter
| Approach | Problem |
|---|---|
| Calor-biased prompts | "Add §Q precondition..." teaches syntax, doesn't test learning |
| C#-biased prompts | Unfair to Calor, doesn't reflect real usage |
| Neutral prompts | Tests whether LLM can translate requirements to idiomatic code |
Task Corpus
The benchmark includes 50 programming tasks across 4 categories:
| Category | Tasks | Examples |
|---|---|---|
| basic-algorithms | 15 | Factorial, Fibonacci, IsPrime, GCD, Power |
| safety | 10 | SafeDivide, Clamp, SafeModulo, NormalizeScore |
| data-structures | 10 | Sum, Max, Min, Average, Median |
| logic | 15 | BoolToInt, LogicalAnd, IsMultipleOf, SameSign |
Scoring Formula
Each task is scored on compilation and test results:
| Factor | Weight | Description |
|---|---|---|
| Compilation | 40% | Does the generated code compile? |
| Test Cases | 60% | What percentage of test cases pass? |
Final score: 0.4 × compile + 0.6 × tests
Results
Overall
| Metric | Calor | C# | Advantage |
|---|---|---|---|
| Average Score | 1.00 | 1.00 | 1.00x (Tie) |
| Compilation Rate | 100% | 100% | Tie |
| Test Pass Rate | 100% | 100% | Tie |
By Category
| Category | Calor Score | C# Score | Advantage |
|---|---|---|---|
| basic-algorithms | 1.00 | 1.00 | 1.00x |
| safety | 1.00 | 1.00 | 1.00x |
| data-structures | 1.00 | 1.00 | 1.00x |
| logic | 1.00 | 1.00 | 1.00x |
What This Tells Us
Calor is Learnable
The 1.00x ratio demonstrates that:
- LLMs can learn Calor syntax from skills files
- Calor code compiles at the same rate as C#
- Calor implementations pass tests at the same rate as C#
Task Completion ≠ Code Value
This benchmark measures task completion - can correct code be written? It does not measure:
- Whether the code has better safety properties
- Whether contracts catch bugs that C# misses
- Whether the code is more maintainable
These are different questions that require different benchmarks.
Example: Generated Code Comparison
Task: SafeDivide
Prompt: "Write a public function named SafeDivide that divides integer a by integer b. Division by zero must not be allowed."
Generated Calor:
§M{m001:SafeDivideModule}
§F{f001:SafeDivide:pub}
§I{i32:a}
§I{i32:b}
§O{i32}
§Q (!= b 0)
§R (/ a b)
§/F{f001}
§/M{m001}Generated C#:
public static class Functions
{
public static int SafeDivide(int a, int b)
{
if (b == 0)
throw new ArgumentException("Division by zero");
return a / b;
}
}Both compile and pass tests. But note that Calor uses a contract (§Q (!= b 0)) while C# uses a guard clause. This difference doesn't affect task completion scores, but affects code properties.
Task: NormalizeScore
Prompt: "Write a public function named NormalizeScore that normalizes a score to the 0-100 range using the formula (score * 100) / maxScore. The maxScore must be positive, score must be non-negative, and score must not exceed maxScore."
Generated Calor:
§F{f001:NormalizeScore:pub}
§I{i32:score}
§I{i32:maxScore}
§O{i32}
§Q (>= score 0)
§Q (> maxScore 0)
§Q (<= score maxScore)
§S (>= result 0)
§S (<= result 100)
§R (/ (* score 100) maxScore)
§/F{f001}Generated C#:
public static int NormalizeScore(int score, int maxScore)
{
if (maxScore <= 0)
throw new ArgumentException("maxScore must be positive");
if (score < 0)
throw new ArgumentException("score must be non-negative");
if (score > maxScore)
throw new ArgumentException("score must not exceed maxScore");
return (score * 100) / maxScore;
}Key difference: Calor includes postconditions (§S (>= result 0) and §S (<= result 100)) that guarantee output bounds. C# has no equivalent - bugs in the output go undetected.
Contract Usage Analysis
When examining generated Calor code, the LLM consistently extracts constraints from requirements into contracts:
| Requirement Language | Generated Contract |
|---|---|
| "must only accept non-negative" | §Q (>= n 0) |
| "must not be zero" | §Q (!= n 0) |
| "result is never negative" | §S (>= result 0) |
| "result is always at least 1" | §S (>= result 1) |
| "must be between X and Y" | §Q (>= n X) + §Q (<= n Y) |
This demonstrates that the Contract-First Methodology in the skills file is working - the LLM translates requirement language into executable contracts.
Benchmark Execution
Running Locally
# Run the LLM benchmark with neutral prompts
dotnet run --project tests/Calor.Evaluation -- llm-tasks \
--manifest tests/Calor.Evaluation/Tasks/task-manifest-neutral.json \
--verbose
# Refresh cache (re-run all tasks)
dotnet run --project tests/Calor.Evaluation -- llm-tasks \
--manifest tests/Calor.Evaluation/Tasks/task-manifest-neutral.json \
--refresh-cacheCost Controls
- Results are cached to avoid redundant API calls
- Budget caps prevent runaway costs
- Estimated cost: ~$0.76 for 50 tasks
Model Information
| Property | Value |
|---|---|
| Provider | Anthropic |
| Model | Claude Sonnet 4 |
| Model ID | claude-sonnet-4-20250514 |
Transparency
Reproducibility
All benchmark data is available:
- Task definitions:
tests/Calor.Evaluation/Tasks/task-manifest-neutral.json - Skills file:
tests/Calor.Evaluation/Skills/calor-language-skills.md - Results:
llm-results.json(includes generated code for every task)
Limitations
- Results may vary with model updates
- 50 tasks is a limited sample size
- Simple tasks may not reveal differences in more complex scenarios
Future: Safety Benchmark
Task completion measures "can correct code be written?" A separate Safety Benchmark would measure "does the code catch more bugs?" by:
- Running adversarial tests with invalid inputs
- Measuring contract enforcement (Calor) vs exception handling (C#)
- Comparing error message quality and precision
This would highlight Calor's contract advantages in scenarios where C# code might silently fail or produce incorrect results.
Next
- Methodology - Full benchmark methodology
- Token Economics - Token count comparison