Agent Tasks
Agent Task Benchmark
This benchmark tests Claude's ability to generate correct Calor code from natural language prompts. It validates that Claude can learn Calor syntax from the bundled documentation and produce working code.
Agent Task Benchmark
Testing Claude's ability to generate correct Calor code from natural language
107462eOverall Progress
Results by Category
Perfect Score Categories (10)
Areas for Improvement
Methodology
Each task provides a natural language prompt describing code to generate. Claude reads CLAUDE.md syntax reference and generates Calor code. Success requires passing custom verification scripts.
How It Works
Each test:
- Provides a natural language prompt describing what code to generate
- Claude reads CLAUDE.md with Calor syntax reference
- Claude generates Calor code in a
.calrfile - Custom verification scripts check the generated code
Success requires the generated code to:
- Compile without errors
- Match expected syntax patterns
- Pass any contract verification (when enabled)
Test Categories
| Category | Focus Area |
|---|---|
| Basic Syntax | Functions, parameters, return types |
| Contract Writing | Preconditions, postconditions, invariants |
| Advanced Contracts | Quantifiers, implications, custom messages |
| Control Flow | Loops, conditionals, pattern matching |
| Collections | List, Dictionary, HashSet operations |
| Effects System | Effect declarations, file/network/console |
| OOP Features | Classes, interfaces, properties |
| Async Functions | Async/await patterns |
| Generics | Generic functions and classes |
| Type System | Option, Result, type inference |
| Refactoring | Code transformation tasks |
Running the Benchmark
# Run tests with single-run mode
./tests/E2E/agent-tasks/run-agent-tests.sh --single-run
# Run with majority voting (3 runs, 2/3 needed)
./tests/E2E/agent-tasks/run-agent-tests.sh
# Generate benchmark JSON for website
./tests/E2E/agent-tasks/generate-benchmark.shSkill Quality Validation
This benchmark serves as validation for the Calor skill documentation:
- Tests fail → Skill documentation needs improvement
- Tests pass → Claude can learn syntax from documentation
When tests fail, the fix is improving CLAUDE.md and skill files, not the prompts.
Pass Rate Threshold
The benchmark requires an 80% pass rate to be considered successful. This threshold balances:
- High bar for skill quality — Most tasks should succeed
- Allowance for edge cases — Complex syntax patterns may need iteration
- Practical reliability — Users can expect Claude to generate working code
Adding New Tests
To add a new agent task test:
- Create a directory in
tests/E2E/agent-tasks/tasks/{category}/{number}_{name}/ - Add
task.jsonwith prompt and metadata - Add
verify.shwith verification logic - Run tests to validate
See Adding Benchmarks for full details.