SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Shanshan Zhong¹ Yi Lu¹ Jingjie Ning¹ Yibing Wan¹ Lihan Feng¹ Yuyi Ao¹

Leonardo F. R. Ribeiro² Markus Dreyer² Sean Ammirati¹ Chenyan Xiong¹

¹Carnegie Mellon University ·

²Amazon AGI

COLM'26 Paper accepted at the 2026 Conference on Language Modeling. 🎉🎉

20Tasks

15Sub-domains

100Instances

3Eval Levels

Paper Code Browse Tasks Leaderboard

Key Contributions

Three pillars that together enable systematic evaluation of continual skill learning.

🧪

SkillLearnBench

The first benchmark for evaluating continual skill learning, featuring verified skill-dependent tasks with multi-instance testing for reusability, grounded in a real-world community-driven taxonomy.

20 tasks100 instances

📐

Three-Level Framework

Evaluates skill quality, execution trajectory, and task outcome — diagnosing where methods succeed or fail along the full skill-to-outcome chain.

qualitytrajectoryoutcome

🔬

Controlled Comparison

The first controlled comparison of recent continual learning methods — revealing a large gap to human-authored performance and highlighting directions for future improvement.

4 methods6 LLMs

Abstract

The first benchmark for evaluating continual skill learning methods on real-world agent tasks.

Skills have become the de facto way to enable LLM agents to perform complex real-world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluating continual skill learning methods, comprising 20 verified, skill-dependent tasks across 15 sub-domains derived from a real-world skill taxonomy, evaluated at three levels: skill quality, execution trajectory, and task outcome. Using this benchmark, we evaluate recent continual learning techniques, those leveraging one-shot, self/teacher feedback, and skill creator to generate skills from agent experiences. We find that all continual learning methods improve over the no-skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably help. Continual learning improves tasks with clear, reusable workflows but struggles on open-ended tasks, and using stronger LLM backbones does not consistently produce better skills. Our analysis also revealed that multiple iterations in continual learning facilitate genuine improvement via external feedback, whereas self-feedback alone induces recursive drift. Our data and code are open-source to enable further studies of automatic skill generation and continual learning techniques.

Three-Level Evaluation Framework

Skill generation and usage form a chain: a method produces a skill specification → which guides the agent's execution → which determines the task outcome. We evaluate each stage to diagnose where methods differ or fail.

Skill Quality

Evaluates generated skills as textual artifacts without executing them.

Coverage — fraction of reference key points supported by the generated skill
Executability — completeness, determinism, consistency, and usability
Safety — average score across six risk dimensions

Trajectory Analysis

Assesses the agent's execution behavior when using the generated skills.

Trajectory Alignment — key-point recall, order correctness, completion
Skill Usage Rate — fraction of generated skills actually invoked

Task Outcome

Measures the final task results with deterministic verifiers.

Task Accuracy — binary pass/fail per instance, aggregated across tasks
Solving Efficiency — total token consumption during task solving

Citation

If you find SkillLearnBench useful, please consider citing our COLM'26 paper:

@article{zhong2026skilllearnbench,
  title={SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks},
  author={Zhong, Shanshan and Lu, Yi and Ning, Jingjie and Wan, Yibing and Feng, Lihan and Ao, Yuyi and Ribeiro, Leonardo F. R. and Dreyer, Markus and Ammirati, Sean and Xiong, Chenyan},
  journal={arXiv preprint arXiv:2604.20087},
  year={2026},
  url={https://arxiv.org/abs/2604.20087}
}