Agent Tasks

Agent Task Benchmark

This benchmark tests Claude's ability to generate correct Calor code from natural language prompts. It validates that Claude can learn Calor syntax from the bundled documentation and produce working code.

Agent Task Benchmark

Testing Claude's ability to generate correct Calor code from natural language

Last run: Feb 16, 2026
Commit: 107462e
Pass Rate
86.5%
Meets 80% threshold
Tests Passed
77
of 89 total
Tests Failed
12
Categories
17
10 at 100%

Overall Progress

Passed
Failed
77 passed12 failed

Results by Category

Advanced ContractsQuantifiers, implications, custom messages
5/5100%
Basic SyntaxSimple functions and types
4/4100%
Calor IdiomsCommon Calor patterns
2/2100%
Contract WritingPreconditions and postconditions
5/5100%
EnumsEnum definitions and extensions
3/3100%
GenericsGeneric functions and classes
3/3100%
GitHub ProjectsReal-world project tasks
4/4100%
Logic ImplementationAlgorithm implementation with contracts
4/4100%
OOP FeaturesClasses, interfaces, properties
6/6100%
Type SystemOption, Result, type inference
6/6100%
Control FlowLoops, conditionals, iteration
7/888%
CollectionsList, Dictionary, HashSet operations
5/683%
Effects SystemEffect declarations and tracking
5/683%
Pattern MatchingSwitch expressions and patterns
5/683%
String OperationsString manipulation functions
4/580%
RefactoringCode transformation tasks
6/875%
Async FunctionsAsync/await patterns
2/450%
Lambdas & DelegatesLambda expressions and delegates
2/450%

Perfect Score Categories (10)

advanced contractsbasic syntaxcalor idiomscontract writingenumsgenericsgithub projectslogic implementationoop featurestype system

Areas for Improvement

async functionsComplex await/ConfigureAwait syntax
lambdas delegatesBlock lambda syntax specifics

Methodology

Each task provides a natural language prompt describing code to generate. Claude reads CLAUDE.md syntax reference and generates Calor code. Success requires passing custom verification scripts.

Mode: Single run (no majority voting)
Threshold: 80% pass rate required for success

How It Works

Each test:

  1. Provides a natural language prompt describing what code to generate
  2. Claude reads CLAUDE.md with Calor syntax reference
  3. Claude generates Calor code in a .calr file
  4. Custom verification scripts check the generated code

Success requires the generated code to:

  • Compile without errors
  • Match expected syntax patterns
  • Pass any contract verification (when enabled)

Test Categories

CategoryFocus Area
Basic SyntaxFunctions, parameters, return types
Contract WritingPreconditions, postconditions, invariants
Advanced ContractsQuantifiers, implications, custom messages
Control FlowLoops, conditionals, pattern matching
CollectionsList, Dictionary, HashSet operations
Effects SystemEffect declarations, file/network/console
OOP FeaturesClasses, interfaces, properties
Async FunctionsAsync/await patterns
GenericsGeneric functions and classes
Type SystemOption, Result, type inference
RefactoringCode transformation tasks

Running the Benchmark

Bash
# Run tests with single-run mode
./tests/E2E/agent-tasks/run-agent-tests.sh --single-run

# Run with majority voting (3 runs, 2/3 needed)
./tests/E2E/agent-tasks/run-agent-tests.sh

# Generate benchmark JSON for website
./tests/E2E/agent-tasks/generate-benchmark.sh

Skill Quality Validation

This benchmark serves as validation for the Calor skill documentation:

  • Tests fail → Skill documentation needs improvement
  • Tests pass → Claude can learn syntax from documentation

When tests fail, the fix is improving CLAUDE.md and skill files, not the prompts.


Pass Rate Threshold

The benchmark requires an 80% pass rate to be considered successful. This threshold balances:

  • High bar for skill quality — Most tasks should succeed
  • Allowance for edge cases — Complex syntax patterns may need iteration
  • Practical reliability — Users can expect Claude to generate working code

Adding New Tests

To add a new agent task test:

  1. Create a directory in tests/E2E/agent-tasks/tasks/{category}/{number}_{name}/
  2. Add task.json with prompt and metadata
  3. Add verify.sh with verification logic
  4. Run tests to validate

See Adding Benchmarks for full details.