Tutorial: How to Train an Agent
Tutorial on training Agents
Introduction
Training an agent is very straightforward as RL has evolved and placated itself the forefront of training LLMs. Before RL became intuitive and understood, LLMs were trained primarily through pretraining which involved organizing large corpuses of data that might not be available to the average developer. RL solves this by relying on mostly synthetic data generated in an RL environment.
RL Environments can be pictured visually like a video game. They are the set of rules like in Chess the colored squares and the rules how each piece moves. Additionally, like Chess there is a way to determine, verify, a task was done successfully i.e. make a correct move (e.g. 1.e4) or outright win the game. RL Environments range from games like Chess or 2048 to an simulation of a computer terminal to even an email box. The important part of an environment is that there is a way to verify the task was done either bad (0) or good (1). Most of the coding time for training your agent will be spent on the environment and the verifying function i.e. the rubric for the success or failure of task in binary terms (1 or 0). The quality of an okay agent vs a good agent is often how well the environment and verifier models real life.
Setting up an Environment
Following Along
Here is a link to a Jupyter Notebook where all of the code is defined. You can either follow along the Notebook or follow along here.
Objective
Before we define the environment, we need to indentify the objective of our agent. We will be training an agent that uses agentic RAG. Agentic RAG is a method of RAG where searches are done multi-turn, comparable to how a human searches, and therefore it can adapt on the fly to refine results based off previous results. An example of searching for a topic generally then refining the search to find more specific information.
Choosing a Model
For this training run, we will use Qwen2.5-32B-Instruct. An open source open license model that is compatible with nearly all of the popular inference stack like vLLM.
Scaffolding the Environment with Tools
The environment will be a SQLite database.
The 2 tools used by the agent to interact with the database are:
search_repo: Search a repo for functions by keywords.
read_repo: Read the full details of a function from a repo.

These tools are used for the retrieval of context, and use the fts5 SQLite extension to do full-text search. This approach was selected over embedding-similarity search due its simplicity (don't need embedding model or have to worry about chunking strategies) and strong performance. The tools use sensible defaults to manage context pressure (I found that the vLLM server would timeout if the context grew too large during rollouts), like only returning the first 10 functions of a repo.
The ergonomics of your tools and the agents experience using them, is critical. Your tools will end up as prompts to the LLM, and as you would engineer a well-defined prompt, you should do the same with your tools. Similarly as software engineers would appreciate well documented, structured, self-intuitive and clean code, an LLM would too. Think of this as Developer Experience but for agents. This is an low-hanging fruit for agent performance. Some examples of things you can do include:
Write good doc strings: describe the function, inputs, return values, and show an example usage.
Use intuitive names: for both functions and parameters, so the model can infer meaning.
Clean code: keep logic clear and easy to follow. E.g. avoiding large amounts of if-statements.
The Agentic Loop
The agentic loop is a programmatic loop with an exit condition (e.g. max iterations or an error), that drives the agent to its goal. The loop executes the following steps:
Initialize the conversation with the system prompt, that includes tool definitions and schemas using Pydantic.
Request a completion from the LLM.
Parse completion, mapping token space to code space and extract tool calls.
Execute tool call and capture response.
Append completion and tool call result to conversation.
Repeat for max turns or till an answer is found.

The agentic loop used had the following exit conditions:
The agent found an answer and called the return_answer tool.
No tool calls were contained in the LLM response.
Exceptions raised from decoding errors or tool call errors.
10 turns were exhausted.
A limitation of this loop is that it doesn't handle the classes of exceptions explicitly. For example, if the exception for a completion request is a 429 rate-limit error, the agent could retry the request after a delay.
Data Generation
A synthetic data set comprised of question/answer pairs derived from the CodeSearchNet dataset was generated using GPT-4.1. The data set can be found on HuggingFace, containing ~2.3k training samples and 1k test samples. Synthetic data samples were intended to reflect realistic questions asked by a human that were moderately difficult, this meant questions reference at most 3 functions (e.g. describing a workflow using class methods). 3 functions were selected based on benchmarking, targeting an appropriate difficulty. The sample also included reference functions for grounding and a score on how realistic the question was, this can be used for quality filtering.
This step was an iterative process, which involved managing the difficulty and semantics of the question through system prompt tuning. For example, questions like "How do I read a file", were too general and easy for the agent, causing it to rely on it's prior knowledge as opposed to using the tools. Appending phrases like "In this library/repo" also helped narrow the scope for the agent. It was also important to not include the reference functions in the question; when the functions were included, the agent would often just recite the functions in the answer and the LLM as a judge would mark the question correctly, despite the generated answer differing semantically from the reference answer, which is a form of reward hacking. It's critical to ensure your evals are resilient to reward hacking, this was done by manual inspection of completions.
Below is an example Q/A pair:

Training
The E2E training pipeline involved both local and remote development phases and is summarized below:
Local
Benchmark models, to find model suitable for fine-tuning. This involved running benchmarks over the Qwen2.5 family, starting at 7B and increasing to 32B. Iterations on the data generation was also done here to adjust difficulty (e.g. reducing the maximum number of functions in a question to 3).
Continue evaluating performance of model from 1. In this phase tools were refined and the system prompt was tuned to maximize the performance of the model before training. You also do not want to do tool debugging during training.
Remote
Begin fine-tuning using ART remotely. A section below will detail this training loop.
Benchmark trained model against testing data.
Repeat until satisfactory test results (refining tools, hyper parameters etc.)
I experienced teething issues in the remote development loop such as vLLM crashes due to timeouts and unexpected ssh connection closes. I found these tools to be critical for this phase:
rsync for pushing and pulling files between local and remote.
tmux for creating persistent sessions that I could reattach in the case the ssh connection closed.
uv for making dependency management a breeze.
Weights and Biases for observability, providing metrics and logging, essential for visibility into the training runs.
Reinforcement Learning Setup
The model was fine-tuned using the GRPO RL algorithm. This is the algorithm that ART uses and is a form of PPO. At a high level GRPO does:
Generate a batch (in our case 4 per prompt) of completions per prompt and score them using a reward function (an LLM as a judge was used).
Calculate the group advantage, which is looking at how well each completion did relative to one another.
Get the per-token advantage, which is done by getting the token probabilities from a forward pass.
Increase the probability of generating those tokens which resulted in a higher reward and decrease the probabilities of those that resulted in a lower reward.
But also don't change the model too much, which is enforced through KL divergence. This is a form a regularization that keeps the model in the same behavioral space, targeting small surgical changes that improve the model.
The reward function used was an LLM as a judge, as the verification of the answer isn't objectively verifiable. Gemini 2.5 Flash was used as the judge LLM. Gemini was selected to avoid having the synthetic generation LLM and the judge LLM being from the same family to avoid bias.
Having strong evals formed by reward functions is critical for observability, understanding, and measuring your system. You can then create a feedback loop using these evals for system prompt tuning and tool use, having confidence that your agent is improving. This feedback loop was critical for the local training phase.
Training Loop
ART is a powerful training harness that manages a Unsloth and vLLM instance under the hood. Unsloth supports LoRA (Low-Rank Adaptation) adapters which train a much smaller proportion of model weights, drastically reducing the resource requirements (training memory and disk) for training, allowing for frequent checkpointing. LoRA is a suitable choice, as this agent is doing one task that is specific and narrow. If the goal of the agent was to be much more general and multi-task, LoRA might not be the right choice (as LoRA bounds the amount the model can learn). vLLM is an efficient serving engine for inference offering strong throughput, efficient memory management (uses PagedAttention to manage KV cache), and batch completions.
The training loop using ART is summarized in the sequence diagram below:

Finally, Results
The model was trained for 2 epochs with a learning rate of 1.2e-5. Rollouts consisted of 4 completions per prompt and a batch size of 12 questions, resulting in a batch size of 48 for gradient updates (these hyper parameters were taken from https://openpipe.ai/blog/art-e-mail-agent). The model was trained on cluster using a H100 for 2 days which roughly at the time of writing costs ~$90.
The training results are shown in the graph below, where the model achieved an average of 86% of questions answered correctly in a batch at training step 264 on the training set.:

In conclusion, RL is an incredibly powerful tool for training models with complex reward signals, that wouldn't be possible through SFT. Its application to fine-tuning smaller models for agentic tasks is very effective, beating the performance of much larger SoTA models like GPT-5 (you win on cost and latency).
Last updated