Generation Accuracy

Category: Generation Accuracy Result: C# wins (0.94x) What it measures: Compilation success and structural correctness

Overview

The Generation Accuracy metric measures how correctly AI agents can generate valid code in each language.

Why It Matters

When an agent generates code, it must:

Produce syntactically valid code
Create structurally complete programs
Avoid compilation errors

Languages with familiar patterns and extensive training data have an advantage.

How It's Measured

Composite Score

Plain Text

Score = (CompilationSuccess × 0.5) + (StructureScore × 0.3) + (NoErrors × 0.2)

Component	Weight	Description
Compilation Success	50%	Does the code compile? (1.0 or 0.0)
Structure Score	30%	Are all structural elements present?
No Errors	20%	Zero compilation errors? (1.0 or 0.0)

Structure Score Calculation

Calor Structure Score

Element	Points	Check
Module name	0.20	Module has a name
Has functions	0.30	At least one function
Function name	0.10	Functions have names
Function output	0.10	Functions have output types
Function body	0.10	Functions have bodies

C# Structure Score

Element	Points	Check
Has usings	0.10	Using statements present
Has namespace	0.30	Namespace declaration present
Has class	0.30	Class declaration present
Has method	0.30	Method declaration present

Why C# Wins

1. Training Data Advantage

LLMs have seen vastly more C# code than Calor code. This creates:

Better pattern recognition
Fewer syntax errors
More accurate boilerplate generation

2. Familiar Patterns

// C#: Very common pattern, well-learned
public static int Add(int a, int b)
{
    return a + b;
}

Plain Text

// Calor: Novel syntax, less exposure
§F{f001:Add:pub}
  §I{i32:a}
  §I{i32:b}
  §O{i32}
  §R (+ a b)

3. Error Recovery

C# errors often have clear fixes:

Missing semicolon → add semicolon
Missing brace → add brace
Wrong type → cast or convert

Calor errors may be less familiar:

Misindented block body → requires understanding indent-only scope
Wrong block nesting → requires indentation-level understanding

Example Comparison

Agent Task: "Write a function that adds two integers"

C# generation (typical):

public static int Add(int a, int b)
{
    return a + b;
}

Compiles: Yes
Structure: Complete
Errors: 0
Score: 1.0

Calor generation (typical):

Plain Text

§F{f001:Add:pub}
  §I{i32:a}
  §I{i32:b}
  §O{i32}
  §R (+ a b)

Compiles: Yes
Structure: Complete
Errors: 0
Score: 1.0

Calor generation (common error):

Plain Text

§F{f001:Add:pub}
§I{i32:a}
§I{i32:b}
§O{i32}
§R (+ a b)

Compiles: No
Error: Function body is not indented
Score: 0.0

Error Patterns

Common Calor Errors

Error	Cause	Example
Misindented body	Body not indented under opener	`§F{f001}` followed by unindented statements
Missing block body	Forgot statements inside a block	`§L{for1}` with no indented body
Wrong nesting	Dedented or indented at the wrong level	`§IF{if1}` body aligned with parent
Missing module wrapper	No enclosing `§M{}` block	Function outside module

Common C# Errors

Error	Cause	Example
Missing semicolon	Forgot line terminator	`return x`
Missing brace	Unbalanced braces	`if (x) { return y;`
Type mismatch	Wrong return type	`string` method returning `int`
Missing using	Forgot namespace import	Using `Console` without `System`

Mitigation Strategies

For Calor

Template-based generation: Start with valid templates
Indentation tracking: Maintain consistent block levels
Validation pass: Check structure before outputting

For C#

Standard patterns: Use well-established code patterns
Roslyn validation: Parse generated code before output
Error correction: Attempt to fix common errors

Detailed Metrics

The framework provides detailed breakdowns:

results.Add(MetricResult.CreateHigherIsBetter(
    Category,
    "CompilationSuccess",
    calorCompilation.Success ? 1.0 : 0.0,
    csharpCompilation.Success ? 1.0 : 0.0));

results.Add(MetricResult.CreateLowerIsBetter(
    Category,
    "ErrorCount",
    calorCompilation.Errors.Count,
    csharpCompilation.Errors.Count));

results.Add(MetricResult.CreateHigherIsBetter(
    Category,
    "StructureCompleteness",
    CalculateCalorStructureScore(calorCompilation),
    CalculateCSharpStructureScore(csharpCompilation)));

Interpretation

The 0.94x ratio indicates C# has a small advantage in generation accuracy.

This is expected because:

C# has more training data exposure
C# patterns are more familiar to LLMs
C# tooling provides better error feedback

The gap is relatively small, suggesting Calor's explicit structure helps offset the novelty disadvantage.

Task Completion - End-to-end success rates