In ~7 mins: the third way to adapt a frontier agent (after weights and prompts), the 5 optimizer controls that make text-space training actually behave like training, how to wire the pattern into your Claude Code, Codex, or CLAUDE.md skill file, and a full repo tutorial at the end if you want to run it yourself.
There are two known ways to adapt a frontier agent. Change the weights, or change the prompt / harness.
A new paper from Microsoft argues for a third. The agent’s skill file becomes the trainable artifact, edited by a separate optimizer model under bounded updates and a held-out gate.
The result is best or tied on 52 of 52 evaluated cells. GPT-5.5 lifts +23.5 points in direct chat, +24.8 inside Codex, +19.1 inside Claude Code. The deployed artifact stays under 2,000 tokens.
SkillOpt is a May 2026 paper from Microsoft, Shanghai Jiao Tong University, Tongji University, and Fudan University. Yifan Yang at Microsoft leads the 15-author group. arXiv lists it as 2605.23904, submitted on May 22, with 27 pages, 4 figures, and 6 tables.
Full title: “SkillOpt: Executive Strategy for Self-Evolving Agent Skills.” Hugging Face Papers ranked it #1 paper of the day at capture, with 140 upvotes.
Companion repo microsoft/SkillOpt is MIT-licensed Python 3.10+. The package is at version 0.1.0 with no public releases yet, 66 stars at capture on May 25. It ships configs for six benchmarks, the trainer, scripts/train.py, an eval_only.py entry point, and an optional Gradio dashboard.
Scope is what makes it worth seven minutes. SkillOpt trains one Markdown skill across 6 benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld), 7 target models from GPT-5.5 down to Qwen3.5-4B, and 3 execution modes (direct chat, Codex, Claude Code). That product gives 52 head-to-head cells against human-written, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill baselines. SkillOpt wins or ties every single one.
A skill, in the paper’s sense, is a Markdown file prepended to the agent’s context. It holds procedural rules: how to use tools, what to verify, how to format an answer, what failure mode to avoid. Frontier models do not hold this kind of domain procedure in weights.
SkillOpt treats that file as the external state of a frozen target model. The target never moves. A separate optimizer model edits the skill from scored rollouts. The deployed artifact is one best_skill.md between 379 and 1,995 tokens. No optimizer calls at deployment.
The deep-learning analogy is operational, not decorative. Rollouts are the forward pass. Reflection batches over success and failure trajectories are the backward pass. An actual edit budget plays the role of a learning rate. An actual held-out validation gate plays the role of validation. The epoch-wise slow update plays the role of momentum.
Every reflection pass ends with a top-N cut. The default edit budget is 4, with cosine decay to a floor of 2. The repo also exposes constant, linear, and autonomous schedules. Without a cap, the loop is ad hoc prompt rewriting.
Removing the budget drops SearchQA / SpreadsheetBench / LiveMath to 84.6 / 75.7 / 57.3, against the default 87.1 / 77.5 / 61.3. The paper calls this control the textual learning rate.
Bounded edits keep adjacent skill versions close enough that the next optimizer call can still learn from the last one. Unbounded rewrites break the optimization history before any of the later controls get a chance to use it.
A candidate skill is accepted only if its held-out selection score is strictly greater than the current best. Ties are rejected. The selection split is used only for accept-or-reject decisions. The test split is reported separately.
That is what keeps reflection from becoming drift. Across 6 benchmarks, only 1 to 4 edits per skill survive into the deployed artifact. The optimizer proposes far more. LiveMathematicianBench’s +29.3-point gain comes from a single accepted edit. OfficeQA’s +39.0-point gain also comes from one accepted edit.
Bulk of the optimizer’s text-space search gets rejected. The deployed skill is the small set of changes that actually moved a held-out number.
Edits that fail the gate are not discarded. They enter an epoch-local memory the optimizer reads before proposing the next batch, along with the score drop they caused.
That gives the loop negative feedback during training without adding any inference-time model calls. Removing the buffer drops SearchQA / SpreadsheetBench / LiveMath by 1.6 / 4.6 / 2.4 points in the matched ablation row.
Optimizer learns not to repeat a harmful edit, the way a fine-tuned model learns not to repeat a low-reward output. The difference is that this memory is plain text and lives only for the current epoch.
At each epoch boundary, the optimizer runs the same sampled training items under the previous-epoch skill and the current-epoch skill. Outcomes group into improved, regressed, persistent failure, and stable success.
A concise longitudinal guidance block then goes into a protected region of the skill file. Step-level edits cannot overwrite that region.
Meta skill is separate. Optimizer-side memory of which edit patterns helped or hurt across epochs prepends to future optimizer prompts. It does not ship with best_skill.md.
Removing both slow update and meta skill collapses SpreadsheetBench from 77.5 to 55.0. That 22.5-point drop is the largest single ablation in the paper.
One Markdown file deploys across three harnesses. Direct chat, the Codex CLI in a workspace-write sandbox, and the Claude Code CLI all read the same best_skill.md. The adapter contract is small: build batches, inject the skill, run the native execution loop, return scored trajectories.
Transfer numbers carry the claim. A SpreadsheetBench skill trained inside Codex adds +59.7 points when deployed inside Claude Code (22.1 → 81.8), slightly exceeding the in-domain Claude Code SkillOpt reference. The reverse direction adds +43.6 points back inside Codex (27.5 → 71.1).
Trained skill is a portable artifact, not a harness-specific command recipe. Training cost amortizes across deployment surfaces.
Loop runs eight stages per step.
CLI commands and the install-to-deploy walk are in the appendix at the end.
Rollout: the frozen target model runs a batch from the training split with the current skill.
Reflect: the optimizer model splits the batch into failure and success minibatches, returns structured add, delete, and replace edits.
Aggregate: similar edits merge hierarchically, with failure-driven patches prioritized.
Select: the optimizer ranks edits and clips to the top of the edit budget.
Update: selected edits apply, producing a candidate skill.
Gate: the candidate runs on the held-out selection split. Strictly greater than current best is the only accept condition.
Slow update: at epoch end, the optimizer compares same-task outcomes under last-epoch and current-epoch skills, then writes longitudinal guidance into a protected region.
Meta skill: optimizer-side memory of accepted and rejected patterns is prepended to future optimizer calls. Never ships with the deployed skill.
Paper Figure 4 reproduces one verbatim rule per benchmark from the final best_skill.md of each case study. Two of them carry the flavor.
SpreadsheetBench, after four accepted edits: “Inspect workbook structure and formulas, then write evaluated static values across the full requested target range instead of relying on Excel recalculation.”
DocVQA, after three accepted edits: “For tables, forms, charts, and legends, first bind the question to the exact visual row/header/field, then copy only the aligned answer span.”
Two patterns stand out. Rules are procedural rather than instance-specific (no question, file, or entity is named). Rules also encode discipline frontier models lack zero-shot: workbook-structure-first reasoning, evidence-to-visual binding, answer-format constraints. A practitioner could write rules like these by hand after a day with the benchmark. SkillOpt produces them automatically and validates each one against a held-out split.
The deployed artifact is one Markdown file. That maps cleanly to wherever your agent already loads procedural state.
Claude Code: drop the trained skill into ~/.claude/skills/ and load it on session start.
Codex / OpenAI Agents: render to a per-task SKILL.md or AGENTS.md. The paper’s Codex adapter already uses this contract.
Generic harnesses: CLAUDE.md, .cursorrules, or the system-prompt slot of any agent.
Hermes-style persistent runtimes: a skill-folder entry. The exported artifact is harness-agnostic Markdown by design.
Running the loop on your own task needs three things:
A task family with measurable success. Exact match, executable check, or a verifier you trust.
Held-out train, selection, and test splits. The repo does not ship datasets, so this is on you.
A target model (the agent you ship) and an optimizer model. The paper defaults both to GPT-5.5, and shows a target-matched optimizer still recovers 56% to 74% of the strong-optimizer gain.
Optimizer runs offline. Deployment uses only the final skill file. No extra model calls at inference.
Full install-to-deploy commands are in the appendix below.
Three open problems sit underneath the headline numbers.
Training cost is real. Cost per absolute test-point gain runs from 0.6M training tokens (SpreadsheetBench) to 46.4M (DocVQA). Total training token spend per benchmark in the case studies ranges from 20.8M (OfficeQA) to 213.8M (SearchQA). A team has to budget for the offline run before assuming it pays off.
The loop needs scored tasks. SkillOpt is an optimizer with a verifier. The gate compares numbers from a held-out split. Open-ended creative work, strategy documents, design judgment have no gate to gate on, unless a preference model gets layered in that the paper does not provide.
The repo does not ship datasets. Readers bring their own train, selection, and test splits and their own credentials. The package is at version 0.1.0 with no public releases. Fastest setup path is SearchQA, and even that needs a local split and an Azure OpenAI, OpenAI, or Anthropic key. This is research code, not a turnkey product.
What is actually new is the reframe. The procedure an agent follows becomes a trainable, inspectable text artifact. Not weights. Not a static prompt. Something in the middle, with a versioned history, an audit trail of accepted and rejected edits, and a 379-to-1,995-token deployment footprint.
That is the part of the announcement worth carrying into your own stack, even before training one. The paper points to two next moves: skill libraries that share infrastructure across domains, and self-distillation of trained skills back into target-model weights. Both assume the skill itself is the object being optimized, not a byproduct of prompting.
Which part of your current agent stack would you train as text first, your CLAUDE.md, your skill folder, or your AGENTS.md?
All source links are in the first reply. Full breakdown of recent updates + daily signals in our newsletter (link in bio).
A condensed install-to-deploy walk. Assumes Python 3.10+ and one model backend. Azure OpenAI is the paper’s default.
git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .ALFWorld benchmark needs an extra step: pip install -e “.[alfworld]” then alfworld-download.
Azure OpenAI (default):
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your-key"Or set AZURE_OPENAI_AUTH_MODE=azure_cli to skip the key and use Azure CLI auth.
Other backends also supported: OPENAI_API_KEY for OpenAI direct, ANTHROPIC_API_KEY for Claude, QWEN_CHAT_BASE_URL + QWEN_CHAT_MODEL for Qwen via local vLLM. The repo’s .env.example lists all four.
SkillOpt expects this directory layout:
data/my_split/
train/items.json
val/items.json
test/items.jsonEach items.json is a JSON array of task items. Schema depends on the benchmark. SearchQA wants id, question, context, answers.
Configs ship for six benchmarks: SearchQA, ALFWorld, DocVQA, LiveMathematicianBench, SpreadsheetBench, OfficeQA. The repo does not ship the datasets, so this step is on you. SearchQA is the fastest setup path.
Minimal command (SearchQA, GPT-5.5 as both target and optimizer):
python scripts/train.py \
--config configs/searchqa/default.yaml \
--split_dir /path/to/searchqa_split \
--azure_openai_endpoint https://your-resource.openai.azure.com/ \
--optimizer_model gpt-5.5 \
--target_model gpt-5.5Defaults from configs/_base_/default.yaml: 4 epochs, batch size 40, reflection minibatch 8, edit budget 4 with cosine decay to floor 2, slow update on with 20 samples per epoch, meta skill on, validation gate strict-greater.
CLI override flags: --num_epochs, --batch_size, --workers, --out_root.
Each run writes to outputs/<run_name>/:
best_skill.md: the deployable skill.
history.json: per-step training log.
skills/skill_vXXXX.md: skill snapshot per step.
steps/step_XXXX/: patches, gate evals, edit-apply reports.
slow_update/epoch_XX/ and meta_skill/epoch_XX/: epoch-end logs.
Re-running the same command auto-resumes from the last completed step.
Score a trained skill on any split without retraining:
python scripts/eval_only.py \
--config configs/searchqa/default.yaml \
--skill outputs/my_run/best_skill.md \
--split valid_unseen \
--split_dir /path/to/searchqa_split \
--azure_openai_endpoint https://your-resource.openai.azure.com/Valid --split values: valid_unseen (test), valid_seen (val), train, all.
best_skill.md is plain Markdown. No optimizer calls at inference. Drop it where your agent reads procedural state:
Claude Code: ~/.claude/skills/
Generic: CLAUDE.md or the system-prompt slot
Hermes-style persistent runtime: a skill-folder entry
pip install -e ".[webui]"
python -m skillopt_webui.appDefault port 7860. Add --share for a public Gradio link.