v0.6.5The TokenEconomics benchmark now reports the composite compactness score it always computed instead of raw token count alone — an honest measurement fix that lifts the category to 1.42× and the overall advantage to 1.32×, with the v0.7 gate recalibrated against the corrected metric.See what's new

Generation Accuracy

Category: Generation Accuracy Result: C# wins (0.94x) What it measures: Compilation success and structural correctness


Overview

The Generation Accuracy metric measures how correctly AI agents can generate valid code in each language.


Why It Matters

When an agent generates code, it must:

  • Produce syntactically valid code
  • Create structurally complete programs
  • Avoid compilation errors

Languages with familiar patterns and extensive training data have an advantage.


How It's Measured

Composite Score

Plain Text
Score = (CompilationSuccess × 0.5) + (StructureScore × 0.3) + (NoErrors × 0.2)
ComponentWeightDescription
Compilation Success50%Does the code compile? (1.0 or 0.0)
Structure Score30%Are all structural elements present?
No Errors20%Zero compilation errors? (1.0 or 0.0)

Structure Score Calculation

Calor Structure Score

ElementPointsCheck
Module name0.20Module has a name
Has functions0.30At least one function
Function name0.10Functions have names
Function output0.10Functions have output types
Function body0.10Functions have bodies

C# Structure Score

ElementPointsCheck
Has usings0.10Using statements present
Has namespace0.30Namespace declaration present
Has class0.30Class declaration present
Has method0.30Method declaration present

Why C# Wins

1. Training Data Advantage

LLMs have seen vastly more C# code than Calor code. This creates:

  • Better pattern recognition
  • Fewer syntax errors
  • More accurate boilerplate generation

2. Familiar Patterns

C#
// C#: Very common pattern, well-learned
public static int Add(int a, int b)
{
    return a + b;
}
Plain Text
// Calor: Novel syntax, less exposure
§F{f001:Add:pub}
  §I{i32:a}
  §I{i32:b}
  §O{i32}
  §R (+ a b)

3. Error Recovery

C# errors often have clear fixes:

  • Missing semicolon → add semicolon
  • Missing brace → add brace
  • Wrong type → cast or convert

Calor errors may be less familiar:

  • Misindented block body → requires understanding indent-only scope
  • Wrong block nesting → requires indentation-level understanding

Example Comparison

Agent Task: "Write a function that adds two integers"

C# generation (typical):

C#
public static int Add(int a, int b)
{
    return a + b;
}
  • Compiles: Yes
  • Structure: Complete
  • Errors: 0
  • Score: 1.0

Calor generation (typical):

Plain Text
§F{f001:Add:pub}
  §I{i32:a}
  §I{i32:b}
  §O{i32}
  §R (+ a b)
  • Compiles: Yes
  • Structure: Complete
  • Errors: 0
  • Score: 1.0

Calor generation (common error):

Plain Text
§F{f001:Add:pub}
§I{i32:a}
§I{i32:b}
§O{i32}
§R (+ a b)
  • Compiles: No
  • Error: Function body is not indented
  • Score: 0.0

Error Patterns

Common Calor Errors

ErrorCauseExample
Misindented bodyBody not indented under opener§F{f001} followed by unindented statements
Missing block bodyForgot statements inside a block§L{for1} with no indented body
Wrong nestingDedented or indented at the wrong level§IF{if1} body aligned with parent
Missing module wrapperNo enclosing §M{} blockFunction outside module

Common C# Errors

ErrorCauseExample
Missing semicolonForgot line terminatorreturn x
Missing braceUnbalanced bracesif (x) { return y;
Type mismatchWrong return typestring method returning int
Missing usingForgot namespace importUsing Console without System

Mitigation Strategies

For Calor

  1. Template-based generation: Start with valid templates
  2. Indentation tracking: Maintain consistent block levels
  3. Validation pass: Check structure before outputting

For C#

  1. Standard patterns: Use well-established code patterns
  2. Roslyn validation: Parse generated code before output
  3. Error correction: Attempt to fix common errors

Detailed Metrics

The framework provides detailed breakdowns:

C#
results.Add(MetricResult.CreateHigherIsBetter(
    Category,
    "CompilationSuccess",
    calorCompilation.Success ? 1.0 : 0.0,
    csharpCompilation.Success ? 1.0 : 0.0));

results.Add(MetricResult.CreateLowerIsBetter(
    Category,
    "ErrorCount",
    calorCompilation.Errors.Count,
    csharpCompilation.Errors.Count));

results.Add(MetricResult.CreateHigherIsBetter(
    Category,
    "StructureCompleteness",
    CalculateCalorStructureScore(calorCompilation),
    CalculateCSharpStructureScore(csharpCompilation)));

Interpretation

The 0.94x ratio indicates C# has a small advantage in generation accuracy.

This is expected because:

  • C# has more training data exposure
  • C# patterns are more familiar to LLMs
  • C# tooling provides better error feedback

The gap is relatively small, suggesting Calor's explicit structure helps offset the novelty disadvantage.


Next