Generation Accuracy

Category: Generation Accuracy Result: C# wins (0.94x) What it measures: Compilation success and structural correctness


Overview

The Generation Accuracy metric measures how correctly AI agents can generate valid code in each language.


Why It Matters

When an agent generates code, it must:

  • Produce syntactically valid code
  • Create structurally complete programs
  • Avoid compilation errors

Languages with familiar patterns and extensive training data have an advantage.


How It's Measured

Composite Score

Plain Text
Score = (CompilationSuccess × 0.5) + (StructureScore × 0.3) + (NoErrors × 0.2)
ComponentWeightDescription
Compilation Success50%Does the code compile? (1.0 or 0.0)
Structure Score30%Are all structural elements present?
No Errors20%Zero compilation errors? (1.0 or 0.0)

Structure Score Calculation

Calor Structure Score

ElementPointsCheck
Module name0.20Module has a name
Has functions0.30At least one function
Function name0.10Functions have names
Function output0.10Functions have output types
Function body0.10Functions have bodies

C# Structure Score

ElementPointsCheck
Has usings0.10Using statements present
Has namespace0.30Namespace declaration present
Has class0.30Class declaration present
Has method0.30Method declaration present

Why C# Wins

1. Training Data Advantage

LLMs have seen vastly more C# code than Calor code. This creates:

  • Better pattern recognition
  • Fewer syntax errors
  • More accurate boilerplate generation

2. Familiar Patterns

C#
// C#: Very common pattern, well-learned
public static int Add(int a, int b)
{
    return a + b;
}
Plain Text
// Calor: Novel syntax, less exposure
§F{f001:Add:pub}
  §I{i32:a}
  §I{i32:b}
  §O{i32}
  §R (+ a b)
§/F{f001}

3. Error Recovery

C# errors often have clear fixes:

  • Missing semicolon → add semicolon
  • Missing brace → add brace
  • Wrong type → cast or convert

Calor errors may be less familiar:

  • Missing §/F{id} → requires understanding closing tag rules
  • Wrong ID in closing tag → requires ID matching understanding

Example Comparison

Agent Task: "Write a function that adds two integers"

C# generation (typical):

C#
public static int Add(int a, int b)
{
    return a + b;
}
  • Compiles: Yes
  • Structure: Complete
  • Errors: 0
  • Score: 1.0

Calor generation (typical):

Plain Text
§F{f001:Add:pub}
  §I{i32:a}
  §I{i32:b}
  §O{i32}
  §R (+ a b)
§/F{f001}
  • Compiles: Yes
  • Structure: Complete
  • Errors: 0
  • Score: 1.0

Calor generation (common error):

Plain Text
§F{f001:Add:pub}
  §I{i32:a}
  §I{i32:b}
  §O{i32}
  §R (+ a b)
§/F{f002}              // Wrong ID!
  • Compiles: No
  • Error: Mismatched closing tag
  • Score: 0.0

Error Patterns

Common Calor Errors

ErrorCauseExample
Mismatched IDsWrong ID in closing tag§F{f001}...§/F{f002}
Missing closing tagForgot to close structure§L{for1}... (no §/L)
Wrong tag typeConfused closing tag§F{f001}...§/M{f001}
Missing module wrapperNo §M{}...§/M{}Function outside module

Common C# Errors

ErrorCauseExample
Missing semicolonForgot line terminatorreturn x
Missing braceUnbalanced bracesif (x) { return y;
Type mismatchWrong return typestring method returning int
Missing usingForgot namespace importUsing Console without System

Mitigation Strategies

For Calor

  1. Template-based generation: Start with valid templates
  2. ID tracking: Maintain list of open IDs, ensure matching closes
  3. Validation pass: Check structure before outputting

For C#

  1. Standard patterns: Use well-established code patterns
  2. Roslyn validation: Parse generated code before output
  3. Error correction: Attempt to fix common errors

Detailed Metrics

The framework provides detailed breakdowns:

C#
results.Add(MetricResult.CreateHigherIsBetter(
    Category,
    "CompilationSuccess",
    calorCompilation.Success ? 1.0 : 0.0,
    csharpCompilation.Success ? 1.0 : 0.0));

results.Add(MetricResult.CreateLowerIsBetter(
    Category,
    "ErrorCount",
    calorCompilation.Errors.Count,
    csharpCompilation.Errors.Count));

results.Add(MetricResult.CreateHigherIsBetter(
    Category,
    "StructureCompleteness",
    CalculateCalorStructureScore(calorCompilation),
    CalculateCSharpStructureScore(csharpCompilation)));

Interpretation

The 0.94x ratio indicates C# has a small advantage in generation accuracy.

This is expected because:

  • C# has more training data exposure
  • C# patterns are more familiar to LLMs
  • C# tooling provides better error feedback

The gap is relatively small, suggesting Calor's explicit structure helps offset the novelty disadvantage.


Next