Leaderboard & Analysis

How four continual learning methods — One-Shot, Self Feedback, Teacher Feedback, and Skill Creator — perform across six LLM backbones, evaluated at three levels: skill quality, trajectory, and task outcome. SkillLearnBench is open-source and supports testing more continual learning methods and LLMs.

Continual Learning Methods

We evaluate four methods covering diverse learning strategies. All methods receive the same task instruction plus one seed instance as input and produce a skill set as output.

Workflows of four continual learning methods
Figure 1. Workflows of four continual learning methods through skill generation.

One-Shot

The agent generates a skill set in a single pass from the task description. Serves as a baseline for knowledge acquisition without any feedback or recursive optimization.

Self Feedback

A self-evolution loop: the agent generates initial skills, attempts the task, reviews its own trajectory, identifies issues, and refines skills. This cycle repeats K times without external supervision.

Teacher Feedback

An expert teacher with access to human-authored skills provides directional guidance (without revealing ground-truth) after each failed attempt. The agent updates skills and re-attempts.

Skill Creator

A structured multi-stage pipeline: analyzing task intent, investigating edge cases, writing a skill specification, and validating it with automated checks.

LLM Leaderboard

Ranking of 6 LL M backbones under each continual learning method, across all three evaluation levels. Scores are averaged across all 20 tasks; rank is by task accuracy.

LLM Performance by Method

Click a method tab to re-rank the 6 models.
# Model Level 1: Skill Quality Level 2: Trajectory Level 3: Outcome
Coverage Executability Safety Alignment Usage Tokens ↓ Accuracy
All values are % averages across 20 tasks; tokens are per-task solving cost.

Main Results

Headline takeaways from evaluating four continual learning methods across six LLMs.

📊

Skills Help, but a Large Gap Remains

Every continual learning method outperforms the no-skill baseline, yet even the best method covers only ~45% of the gap to human-authored performance, leaving substantial room for improvement.

🔄

No Universal Winner

No single method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably produce better skills. The best strategy depends on both the task type and the generation LLM.

🔁

External Feedback Drives Real Gains

Multiple iterations with external teacher feedback yield genuine improvement, while self-feedback alone induces recursive drift after the first round — coverage stalls and accuracy degrades.

Main Results on SkillLearnBench

Click column headers to sort. All metrics are in % except #Tokens. Bold: best among four methods per LLM; underline: second best.

Method Level 1: Skill Quality Level 2: Trajectory Level 3: Outcome
Coverage Executability Safety Alignment Usage #Tokens ↓ Acc.
No Skill 70.22727K10.17
Human-authored 92.7752.9692.01 84.4787.67590K74.50
Continual Learning LLM: Claude Haiku 4.5
One-Shot 41.0151.5494.33 76.6858.02537K30.33
Self Feedback 37.4244.4094.25 74.3951.31474K26.50
Teacher Feedback 45.0047.9894.20 75.4849.29639K34.00
Skill Creator 41.8648.5493.80 76.8482.81551K16.67
Continual Learning LLM: Claude Sonnet 4.6
One-Shot 50.2746.8294.46 72.3978.17321K38.83
Self Feedback 50.8647.9293.79 73.2973.86313K31.33
Teacher Feedback 47.4956.6891.79 72.1062.34521K34.83
Skill Creator 43.3250.1293.88 70.5482.33323K19.50
Continual Learning LLM: Claude Opus 4.6
One-Shot 42.2143.8994.56 75.5271.19341K28.17
Self Feedback 42.6145.3694.72 75.3067.17305K31.50
Teacher Feedback 50.5251.4891.93 78.3271.76412K34.00
Skill Creator 45.6949.7794.11 74.9681.58291K30.50
Continual Learning LLM: Gemini 3.1 Flash Lite
One-Shot 22.1039.6994.47 76.4175.33471K19.33
Self Feedback 20.7437.5393.33 75.3667.64490K31.50
Teacher Feedback 34.3938.4092.10 74.4967.68549K15.17
Skill Creator 26.7142.8493.45 75.2291.42501K22.00
Continual Learning LLM: Gemini 3 Flash
One-Shot 36.6146.0695.22 76.2976.25492K31.67
Self Feedback 34.5738.0694.34 77.7766.61399K27.67
Teacher Feedback 41.5045.5993.88 72.3068.75542K29.00
Skill Creator 42.3344.7995.55 77.6677.37424K38.50
Continual Learning LLM: Gemini 3.1 Pro
One-Shot 31.3445.9394.53 73.9466.96604K34.33
Self Feedback 33.5548.7895.59 73.5174.83360K38.00
Teacher Feedback 21.8036.8491.31 74.5541.40504K17.83
Skill Creator 46.8549.2993.01 76.6091.33348K36.83
Average Across All LLMs
One-Shot 37.2645.6694.59 75.2170.99461K30.44
Self Feedback 36.6343.6894.34 74.9466.90390K31.08
Teacher Feedback 40.1246.1692.53 74.5460.20528K27.47
Skill Creator 41.1247.5693.97 75.3084.47406K27.33

Further Analysis

Deeper look at what each continual learning method changes — and when skills actually help.

Learning Effect Across Categories

Continual learning through skills helps most when tasks have reusable structure. The largest gains appear in categories with clear workflows such as Software Engineering and Productivity Tools. Categories that are more open-ended show smaller gains and sometimes regress, suggesting a rigid skill can hurt when the task does not match the learned template.

Token cost by category
Figure 2(a). Solving token cost by task category. Token cost varies substantially across categories, while the four methods are similar within the same category.

Skill Reusability

Most generated skills are partially effective: they pass some instances but fail on others, indicating they capture only part of the task knowledge. Accuracy on held-out instances is comparable to the seed instance, ruling out overfitting. The bottleneck is not that skills overfit to the seed, but that they fail to capture the core task logic needed across all instances.

Skill generalization
Figure 2(b–c). (b) Proportion of skills that pass all, partial, or no instances. (c) Accuracy on the seed instance versus held-out instances.

Learning Effect on Agent Behavior

Different continual learning mechanisms alter agent behavior in fundamentally different ways: Self Feedback produces more focused skills through execution-based revision → shorter, more direct trajectories. Teacher Feedback expands the skill set → higher coverage but heavier execution. Skill Creator produces structured skills invoked most often, yet accuracy lags when content misses key task logic.

Method comparison radar
Method comparison table
Figure 3. Method profiles across six dimensions (left) with raw values (right). Different feedback mechanisms reshape agent behavior in distinct ways.

Skill Evolution Across Learning Rounds

External feedback drives genuine improvement, while self-revision without new information leads to drift. With Self Feedback, coverage stays flat and alignment declines — accuracy briefly rises but then falls sharply. With Teacher Feedback, the first round of external feedback substantially restructures skills, and improvement compounds over subsequent rounds.

Skill evolution
Figure 4. Skill evolution across learning rounds for Self Feedback and Teacher Feedback on Productivity Tools (Claude Sonnet 4.6).