Agent Tasks

Agent Task Benchmark

This benchmark tests Claude's ability to generate correct Calor code from natural language prompts. It validates that Claude can learn Calor syntax from the bundled documentation and produce working code.

Agent Task Benchmark

Testing Claude's ability to generate correct Calor code from natural language

Last run: Feb 16, 2026

Commit: 107462e

Pass Rate

86.5%

Meets 80% threshold

Tests Passed

of 89 total

Tests Failed

Overall Progress

Passed

Failed

77 passed12 failed

Results by Category

Advanced Contracts— Quantifiers, implications, custom messages

5/5100%

Basic Syntax— Simple functions and types

4/4100%

Calor Idioms— Common Calor patterns

2/2100%

Contract Writing— Preconditions and postconditions

5/5100%

Enums— Enum definitions and extensions

3/3100%

Generics— Generic functions and classes

3/3100%

GitHub Projects— Real-world project tasks

4/4100%

Logic Implementation— Algorithm implementation with contracts

4/4100%

OOP Features— Classes, interfaces, properties

6/6100%

Type System— Option, Result, type inference

6/6100%

Control Flow— Loops, conditionals, iteration

7/888%

Collections— List, Dictionary, HashSet operations

5/683%

Effects System— Effect declarations and tracking

5/683%

Pattern Matching— Switch expressions and patterns

5/683%

String Operations— String manipulation functions

4/580%

Refactoring— Code transformation tasks

6/875%

Async Functions— Async/await patterns

2/450%

Lambdas & Delegates— Lambda expressions and delegates

2/450%

Perfect Score Categories (10)

advanced contractsbasic syntaxcalor idiomscontract writingenumsgenericsgithub projectslogic implementationoop featurestype system

Areas for Improvement

async functions — Complex await/ConfigureAwait syntax

lambdas delegates — Block lambda syntax specifics

Methodology

Each task provides a natural language prompt describing code to generate. Claude reads CLAUDE.md syntax reference and generates Calor code. Success requires passing custom verification scripts.

Mode: Single run (no majority voting)

Threshold: 80% pass rate required for success

How It Works

Each test:

Provides a natural language prompt describing what code to generate
Claude reads CLAUDE.md with Calor syntax reference
Claude generates Calor code in a .calr file
Custom verification scripts check the generated code

Success requires the generated code to:

Compile without errors
Match expected syntax patterns
Pass any contract verification (when enabled)

Test Categories

Category	Focus Area
Basic Syntax	Functions, parameters, return types
Contract Writing	Preconditions, postconditions, invariants
Advanced Contracts	Quantifiers, implications, custom messages
Control Flow	Loops, conditionals, pattern matching
Collections	List, Dictionary, HashSet operations
Effects System	Effect declarations, file/network/console
OOP Features	Classes, interfaces, properties
Async Functions	Async/await patterns
Generics	Generic functions and classes
Type System	Option, Result, type inference
Refactoring	Code transformation tasks

Running the Benchmark

Bash

# Run tests with single-run mode
./tests/E2E/agent-tasks/run-agent-tests.sh --single-run

# Run with majority voting (3 runs, 2/3 needed)
./tests/E2E/agent-tasks/run-agent-tests.sh

# Generate benchmark JSON for website
./tests/E2E/agent-tasks/generate-benchmark.sh

Skill Quality Validation

This benchmark serves as validation for the Calor skill documentation:

Tests fail → Skill documentation needs improvement
Tests pass → Claude can learn syntax from documentation

When tests fail, the fix is improving CLAUDE.md and skill files, not the prompts.

Pass Rate Threshold

The benchmark requires an 80% pass rate to be considered successful. This threshold balances:

High bar for skill quality — Most tasks should succeed
Allowance for edge cases — Complex syntax patterns may need iteration
Practical reliability — Users can expect Claude to generate working code

Adding New Tests

To add a new agent task test:

Create a directory in tests/E2E/agent-tasks/tasks/{category}/{number}_{name}/
Add task.json with prompt and metadata
Add verify.sh with verification logic
Run tests to validate

See Adding Benchmarks for full details.