Results

Live Benchmark Results

Evaluated across 207 programs with 8 metrics

Updated: Mar 15, 2026, 05:39 PM3126c21
Overall Advantage
134%
Calor vs C# (100% = equal)
Programs Tested
207
207 Calor successes
Calor Wins
7
of 8 metrics
C# Wins
1
of 8 metrics

Agent Refactoring Benchmark

Measures Claude Code agent success rates on real refactoring tasks (rename, extract, inline, move, add contracts, change signature).

Calor19/20 tasks
95%
C#19/20 tasks
95%
View category breakdown
Rename Symbol
2/3
3/3
Extract Method
4/4
4/4
Inline Function
3/3
3/3
Move Method
3/3
3/3
Add Contract
4/4
3/4
Change Signature
3/3
3/3
Majority voting (2/3 runs) with compilation + Z3 verification580e189

Metric Breakdown

ComprehensionCalor wins
2.22x

Explicit structure aids understanding

Error DetectionCalor wins
1.83x

Contracts surface invariant violations

Refactoring StabilityCalor wins
1.52x

More stable during refactoring

Edit PrecisionCalor wins
1.39x

Unique IDs enable targeted changes

CorrectnessCalor wins
1.30x

Contracts help prevent edge case bugs

Information DensityCalor wins
1.15x

More semantic content per token

Generation AccuracyCalor wins
1.02x

Better code generation from prompts

Token EconomicsC# wins
0.79x

Calor's explicit syntax uses more tokens

Calor better
Tie
C# better
|Center line = 1.0x (equal)

Current Status

Calor leads in 7 of 8 metrics, demonstrating advantages in areas where explicitness matters.

Per-Program Results

Filter by level:
Program
Lvl
Status
Adv
Tokens
Gen Acc
Comp
Edit
Err Det
Info Den
Refactor
Correct
Abs1
1.310.431.001.721.342.000.971.441.60
AbsoluteContracts2
1.680.391.144.041.422.430.901.521.58
AbstractClass2
1.130.831.001.541.361.140.711.491.00
Adapter2
1.480.911.001.871.702.171.111.691.40
AreaOfCircle1
1.460.721.002.381.342.431.061.441.33
ArrayContracts3
1.511.031.031.721.382.431.151.551.80
ArraySlice2
1.570.631.003.201.462.430.941.441.50
ArraySum2
1.230.571.001.981.251.331.431.460.83
AsyncChain3
1.190.831.001.441.461.330.981.491.00
AsyncEffect3
1.290.941.001.551.041.671.491.621.00
AsyncErrorHandling3
1.430.811.002.101.511.671.861.620.91
AsyncLoop3
1.430.931.002.491.461.331.791.461.00
AsyncPure3
1.060.791.001.011.001.330.921.461.00
AsyncReturn3
1.170.701.001.581.341.330.991.441.00
AutoProps2
1.270.531.002.171.361.331.251.491.00
Average2
1.420.661.002.321.341.861.731.441.00
BankAccount3
1.690.721.003.311.682.431.371.521.50
BasicTryCatch2
1.140.551.001.411.341.330.881.491.08
BclCoverage2
1.351.251.141.781.461.431.121.621.00
BinarySearch3
1.170.961.001.491.381.330.781.550.86
BinaryTree4
1.351.901.001.391.401.141.002.040.92
BitContracts3
1.510.511.142.971.422.170.721.521.60
BitSet3
1.620.361.003.191.342.830.741.561.90
BMICalculator2
1.510.501.002.411.342.830.661.441.90
BreadthFirstSearch3
1.361.471.001.851.381.331.431.550.86
BubbleSort2
1.530.811.002.941.292.171.301.521.17
BuggyContracts2
1.380.661.141.761.422.001.231.611.25
Builder3
1.261.121.001.381.401.141.221.651.20
Calculator2
1.130.481.001.731.381.330.611.521.00
Calendar3
1.540.361.002.651.342.830.731.491.90
CancellableTask3
1.290.791.002.071.461.331.221.461.00
Capitalize2
1.380.471.002.181.342.170.811.441.60
CelsiusToKelvin1
1.470.711.002.381.342.431.091.441.33
ChainOfResponsibility3
1.280.951.001.571.401.860.781.581.08
CircularBuffer3
1.570.901.002.721.362.431.381.491.29
Clamp2
1.391.121.001.421.381.890.991.381.90
CollectionLib2
1.490.691.102.811.562.000.971.551.25
Command3
1.341.121.002.061.711.331.131.550.83
CompactClass2
1.490.431.002.821.342.330.891.531.60
ComposedEffects3
1.440.911.142.331.421.671.641.720.71
Composite3
1.691.231.002.061.712.831.221.651.80
Composition3
1.481.061.001.851.402.171.231.721.40
CompoundInterest2
1.480.671.002.511.342.430.941.461.50
ConstructorInit2
1.590.691.003.171.362.431.271.491.33
Contains2
1.200.471.001.881.341.330.871.461.20
ContractedDivide3
1.260.931.000.991.341.890.821.311.80
CorrectEffects2
1.360.761.142.131.421.671.041.751.00
CountOccurrences2
1.410.441.002.251.342.001.151.461.60
CountVowels2
1.120.611.001.501.251.330.751.491.00
CsvParser2
1.600.691.002.481.342.002.861.461.00
CurrencyConverter2
1.550.341.003.131.342.830.771.431.60
CustomException2
1.380.761.001.791.381.861.051.581.60
DatabaseEffect3
1.440.701.142.671.421.671.271.651.00
DateDiff2
1.660.461.003.631.342.830.711.431.90
Decorator3
1.640.821.002.401.752.830.941.581.80
DelayedResult3
1.440.621.002.091.342.501.051.531.40
DepthFirstSearch3
1.371.301.002.021.341.331.261.491.20
Deque3
1.611.061.002.761.362.431.331.461.50
DictOps2
1.550.881.002.941.462.001.381.411.33
DigitCount2
1.620.241.003.101.342.830.981.561.90
Dijkstra4
1.801.391.002.961.372.831.611.461.80
DisjointSet3
1.321.301.001.621.551.331.041.521.20
DistanceCalculator3
1.250.371.001.781.342.000.551.491.50
DivisionContracts3
1.580.651.142.781.422.430.821.611.80
EditDistance3
1.291.311.001.801.251.331.161.491.00
EmailValidator2
1.180.541.002.051.341.330.711.441.00
Encapsulation3
1.201.221.001.311.401.330.821.521.00
EnumMatch2
1.360.621.002.071.341.331.881.441.20
EnumType3
1.340.661.002.371.341.331.181.461.40
EnumWithMethods3
1.290.481.002.371.341.330.941.441.40
ErrorPropagation2
1.310.561.002.211.341.860.921.441.17
ExceptionChain2
1.440.771.002.111.341.711.721.491.36
ExhaustiveMatch2
1.370.681.002.411.341.141.751.461.17
ExpressionBodied1
1.260.641.002.311.341.331.021.411.00
Factorial2
1.130.501.001.311.341.330.821.511.20
Factory3
1.571.001.002.381.382.430.972.101.33
Fibonacci2
1.140.381.001.411.341.330.771.511.40
FileEffects3
1.470.921.102.721.421.671.301.651.00
Filter2
1.350.941.002.691.461.330.731.411.20
FizzBuzz2
1.160.961.001.061.281.500.361.721.40
Flyweight3
1.620.881.002.251.402.831.521.611.50
GCD2
1.100.491.001.321.341.330.781.511.00
GenericClass2
1.441.151.001.981.361.711.381.461.50
GenericConstraints3
1.400.791.002.821.461.330.961.411.40
GenericFunction2
1.360.671.002.901.461.331.101.411.00
GradeCalculator2
1.340.471.001.981.342.170.621.441.70
Graph3
1.710.901.003.151.642.831.181.461.50
GroupBy2
1.561.231.002.861.462.170.961.411.40
GuardClause2
1.380.711.002.501.341.860.821.491.33
GuardMatch2
1.350.801.001.971.341.331.531.441.40
HashMap3
1.671.341.002.281.302.831.511.571.50
HelloWorld1
1.370.721.002.611.331.501.281.531.00
HiddenNetworkEffect3
1.561.291.142.351.461.432.081.751.00
Hypotenuse2
1.490.441.003.041.342.430.921.441.33
Inheritance3
1.330.691.001.741.511.331.491.521.40
InlineSigs2
1.220.461.002.301.341.330.711.411.20
InsertionSort2
1.330.831.002.611.251.331.281.490.83
InterfaceImpl3
1.561.061.002.161.712.171.211.761.40
InterpolationSearch3
1.370.891.001.921.342.170.981.511.14
Inventory3
1.610.871.003.211.362.431.081.461.50
IsAlpha1
1.210.501.002.381.341.330.531.411.20
IsEven1
1.130.601.001.451.341.330.681.441.20
IsOdd1
1.100.601.001.451.341.330.681.441.00
IsPrime3
1.190.651.001.181.381.440.701.441.70
Iterator2
1.470.701.002.251.372.171.431.461.40
Knapsack4
1.680.901.003.151.252.831.241.491.58
LCS3
1.311.021.002.031.251.331.211.441.17
LeapYear2
1.150.711.001.361.381.330.491.551.40
LinearSearch2
1.180.521.002.011.251.330.971.460.86
LinkedList4
1.432.141.001.251.401.141.231.841.40
LinqPipeline2
1.270.831.002.131.341.330.991.411.17
ListInvariant3
1.710.351.144.081.422.430.911.521.80
ListOps2
1.440.581.002.611.342.001.221.411.33
Map2
1.491.041.003.691.461.331.011.411.00
MathLib2
1.490.481.103.141.422.000.831.521.40
MathOperations3
1.251.141.001.021.381.331.001.521.60
MatrixMultiply3
1.490.991.002.451.371.861.491.441.33
MaxHeap3
1.531.591.002.141.401.861.501.581.17
MaxTwo1
1.270.391.001.791.342.000.811.441.40
MaxValue2
1.120.531.001.631.250.891.311.361.00
Mediator3
1.311.451.001.781.401.141.361.520.83
MergeSort3
1.531.901.001.821.382.171.151.521.33
MethodOverloading3
1.170.471.002.181.341.330.661.411.00
MethodOverriding3
1.320.811.001.671.451.171.981.461.00
MinStack3
1.331.251.001.761.641.141.211.461.17
MinTwo1
1.270.391.001.791.342.000.811.441.40
MissingEffects2
1.420.891.142.381.421.671.181.721.00
MixedContracts3
1.520.851.141.971.422.431.201.651.50
MixedSyntax2
1.440.631.002.451.342.430.881.461.33
ModuloContracts3
1.760.291.144.191.422.830.741.541.90
MultipleCatch2
1.190.581.001.621.381.330.961.521.09
NestedMatch2
1.350.911.001.971.341.331.651.441.20
NetworkEffect3
1.450.701.142.641.421.671.371.681.00
NullCheck2
1.150.661.002.011.341.140.791.410.80
Observer3
1.301.101.001.791.401.331.071.551.20
OptionalIds2
1.170.471.002.151.341.330.691.411.00
OptionType2
1.580.691.103.051.441.861.411.581.50
OverflowSafe3
1.561.131.031.761.382.431.401.551.80
OverflowUnsafe3
1.690.981.031.961.342.832.201.381.80
Palindrome2
1.210.541.001.831.341.331.141.491.00
ParallelTasks3
1.380.671.002.831.461.331.291.441.00
PasswordValidator2
1.200.671.002.001.341.330.811.461.00
PhoneBook2
1.590.751.002.831.362.831.041.411.50
Polymorphism3
1.311.281.001.311.711.330.881.581.40
Power2
1.210.821.001.131.381.440.941.411.60
PriorityQueue3
1.531.731.002.121.401.861.191.651.33
Properties2
1.440.721.102.221.511.861.221.551.33
PropertyAccess1
1.990.781.102.091.512.175.091.581.60
PropertyMatch2
1.370.801.001.891.681.331.541.521.20
ProvableContracts2
1.550.591.142.521.422.431.051.651.58
Proxy3
1.330.831.001.741.711.431.051.681.20
PureComputation2
1.460.501.142.731.422.000.881.521.50
PureFunctions1
1.440.461.141.981.462.170.831.751.70
Queue3
1.261.091.001.521.401.141.251.551.17
QuickSort4
1.512.161.001.851.291.861.071.721.14
RangeContracts3
1.590.471.142.851.422.830.791.611.58
RangeMatch2
1.560.601.002.321.342.171.951.441.70
RecordType3
1.560.281.103.611.742.000.871.571.30
Reduce2
1.320.801.002.581.341.330.901.411.20
ResultType2
1.281.101.001.851.401.331.151.520.92
ReverseArray2
1.640.511.003.501.342.171.811.461.33
ReverseString2
1.230.751.001.791.341.331.311.460.83
ScoreBoard2
1.650.811.002.921.362.831.101.411.80
SealedClass2
1.281.301.001.701.361.331.031.491.00
SearchContracts3
1.670.391.143.391.422.830.831.551.80
SelectionSort2
1.520.881.002.731.252.171.491.491.17
SetOps2
1.601.181.002.771.462.001.381.491.50
ShoppingCart4
1.641.771.002.251.302.431.092.131.19
Sign1
1.090.561.001.381.341.330.681.441.00
SimpleAsync3
1.340.781.001.961.331.501.591.531.00
SimpleClass2
1.680.741.002.801.682.831.031.581.80
SimpleMatch2
1.420.601.002.151.341.332.091.441.40
Singleton2
1.180.721.001.681.401.331.011.520.77
Sort2
1.240.981.002.101.251.330.991.440.83
SortedList2
1.430.861.002.071.632.001.181.441.25
SortingContracts3
1.690.431.143.691.422.830.661.521.80
Stack3
1.281.181.001.441.401.141.141.551.40
State3
1.550.591.002.601.401.332.541.521.40
StateEffect3
1.470.651.142.301.441.671.881.651.00
StaticMembers3
1.300.741.101.691.461.331.591.461.00
Strategy3
1.271.141.001.741.531.330.911.551.00
StringContracts3
1.430.891.031.761.342.431.221.461.29
StringLib2
1.430.461.102.871.422.000.781.521.25
StringUtils2
1.110.521.001.691.381.330.651.520.83
SumDigits2
1.650.191.003.371.342.831.021.561.90
SumRange2
1.280.881.001.511.251.861.101.491.17
SwitchExpression2
1.640.721.002.261.342.172.451.491.70
TaxCalculator3
1.480.381.002.631.342.830.661.441.58
TemperatureConverter1
1.130.541.001.591.341.330.731.491.00
TemplateMethod3
1.030.830.760.540.911.141.021.921.10
TernaryChain2
1.270.621.001.941.341.331.151.411.40
Timer2
1.620.661.003.031.402.431.371.551.50
TimerEffect3
1.710.421.103.711.422.830.781.541.90
TodoList2
1.610.921.002.511.632.431.441.461.50
TowerOfHanoi3
1.560.811.002.651.382.430.971.651.58
Trie4
1.621.361.002.091.402.831.301.611.39
TruncateString2
1.310.531.002.171.341.861.051.441.07
TryFinally2
1.220.561.001.821.341.331.201.491.00
TupleMatch2
1.310.611.002.021.341.331.311.441.40
TypeAlias3
1.530.751.103.081.441.631.631.441.17
TypeMatch2
1.440.781.001.751.381.332.391.521.40
UnitConverter2
1.420.381.002.941.342.170.751.411.40
UrlParser2
1.250.641.001.971.341.330.911.411.40
Visitor3
1.080.940.760.610.941.330.812.231.00
VotingSystem2
1.470.641.002.501.362.171.091.441.60
WildcardMatch2
1.270.461.002.091.341.331.111.441.40
Zip2
1.390.941.003.141.341.330.961.411.00

Showing 207 of 207 programs. Values above 1.0 favor Calor (highlighted in pink), values below 1.0 favor C# (highlighted in teal).


The Tradeoff

The benchmark results reveal a fundamental tradeoff in language design for AI agents:

Explicitness vs. Efficiency

Calor's design prioritizes explicit semantics—contracts, effect annotations, unique IDs—that enable better reasoning about program invariants. This comes at the cost of token efficiency, as explicit syntax requires more tokens than implicit conventions.

C#'s ecosystem maturity gives it advantages in areas like task completion and generation accuracy, where LLMs benefit from extensive training data. However, Calor's explicit contracts provide measurable benefits for error detection.


When to Use Calor

Based on results, Calor is most valuable when:

  • Contract verification matters — The error detection advantage validates explicit contracts
  • Edit precision is important — Unique IDs enable targeted modifications
  • Agent comprehension is critical — Explicit structure provides clear signals
  • Token budget is flexible — You can afford the overhead of explicitness

Use C# when:

  • Token efficiency is paramount — Context window is limited
  • Ecosystem libraries are needed — Leverage existing tooling
  • Human readability is priority — Familiar syntax for human developers

Methodology

All benchmarks are automated and reproducible. Each program is evaluated across 8 metrics comparing Calor and C# implementations.

The evaluation framework:

  1. Generates prompts for each program/metric combination
  2. Records LLM responses for both Calor and C# versions
  3. Evaluates responses against ground truth
  4. Computes ratios (greater than 1.0 favors Calor, less than 1.0 favors C#)

See Methodology for full details on the evaluation framework.


Running Benchmarks

To regenerate benchmark results:

Bash
# Run the evaluation framework
dotnet run --project tests/Calor.Evaluation -- run -f website -o website/public/data/benchmark-results.json

# Start the website to view results
cd website && npm run dev

Results are automatically loaded from benchmark-results.json and displayed in the dashboard above.