Quickstart

Building AI Agents for Sedona Marketplace is not very difficult and does not require being an expert in deep learning. Follow this walkthrough to learn about how crafting environments with RL (Reinforcement Learning) can teach a LLM model to learn new capabilities and become more agentic.

In this example, we will be training a small open-source model released by Alibaba, Qwen3-4B. We will be training the model to play the hit game Wordle where a user is given 6 chances to guess a five letter word and is progressively given more hints over time based off how close their guesses are.

To follow along, rent a cloud GPU:

Renting a GPU

The example we will be working will require 4 x H200. At the time of writing, the cost is $7.56 per hour to run the GPUs. The example will take 4-5 hours so the total cost is around $25 - $35. We recommend renting GPUs at Prime Intellect, a cloud GPU aggregator; specifically, our favorite GPU provider is Lambda Labs but will run a couple extra dollars hourly.

Setting Up the Environment

There are multiple frameworks for building and training on RL environments. For this example, we will be using Slime. Slime is based off SGLang an open-source inference library for LLMs used in production by top AI labs such as Xai's Grok; Slime was used in production to post-train Z.ai's agentic coding model GLM-4.5;

To set up our environment, ssh into your 4 x H100 GPU cluster and then:

Clone the example repository:

$ git clone https://github.com/nicklandshark/slime-wordle-example.git

Navigate to inside it: 2) $ cd slime-wordle-example

Clone Slime and pull the official Docker image: 3) $ git clone https://github.com/THUDM/slime.git 4) $ docker pull slimerl/slime:latest

Now, start the Docker container:

docker run --rm --gpus all \
--ipc=host --shm-size=16g \
-e HF_HOME=/cache/hf \
-e HUGGINGFACE_HUB_CACHE=/cache/hf/hub \
-e TRANSFORMERS_CACHE=/cache/transformers \
-e XDG_CACHE_HOME=/cache/xdg \
-e TORCHINDUCTOR_CACHE_DIR=/cache/torchinductor \
-e TRITON_CACHE_DIR=/cache/triton \
-v "$(pwd)":/workspace -w /workspace \
-v slime_cache:/cache \
-it slimerl/slime:latest bash

How Does Slime Work?

Slime is a powerful backend with a simple interface. The vision of Slime is users only need to focus on implementing their own custom logic to freely unleash their creativity. We will share shortly what that logic looks like, but first, below is a diagram explaining the lifecycle of a rollout in Slime. Rollouts are the primary unit of runs in RL. In our case, one rollout is just one full try: you give the model a prompt, it writes an answer, you give that answer a score (reward), and that single try is used to tweak the model.

Running a SFT Warmup

When post-training a model, it is often help to "warm it up" with SFT (Supervised Fine-Tuning). In essence SFT is when we give the question and the answers to the model so it can train on the test. Starting with SFT is helpful because it teaches the model the format before we let it play in the environment.

To start, install the dependencies:

  mkdir -p /workspace/.cache/uv
  export UV_CACHE_DIR=/workspace/.cache/uv
  uv venv
  uv pip install -e .

Build the dataset:

$ `uv run scripts/build_wordle_datasets.py \ --output-dir data/generated \ --train-count 200 --eval-count 20 --seed 13

Download the model: 2) $ bash models/install-qwen3-4b.sh

Convert the model from HuggingFace format to Megatron format for Slime: 3) $ export HF_CHECKPOINT=$PWD/models/Qwen3-4B && bash models/convert-qwen3-4b.sh 4) $ export REF_CKPT_PATH=${REF_CKPT_PATH:-$PWD/checkpoints/Qwen3-4B_torch_dist}

Then train with SFT:

export NUM_GPUS=4                   # your 4×H100 box
export MASTER_ADDR=127.0.0.1        # or your node’s IP if needed
export SAVE_DIR=outputs/wordle_sft  # optional

bash scripts/run_wordle_sft.sh

When it is complete, Ray will announce the job is finished.

What is GRPO?

For the remaining of the example, we will be using GRPO as our RL algo. GRPO (Group Relative Policy Optimization) was introduced by DeepSeek. GRPO is a simple twist on PPO that runs a tiny “tournament” among several answers to the same prompt, then nudges the model to make the winning answers more likely and the losing ones less likely. GRPO has become popular because it is memory-friendly "stable" meaning training behaves predictability.

We are using GPRO because it excels when rewards (the score we give the model on its ability to do a task) are comparable across answers to the same question which is our setup for Wordle. We create 16 replicas of the same game and give the model each one and score it how well it was able to play it. We calculate group statistics based off their success and failures and then train our model on it.

Running a GRPO Training Run

To track the progress of the training run, it is helpful to use a service like Weights and Biases. Create an account and then create a new project and copy the API Key that is displayed.

Export the proper W&B environment variables for the training script:

export USE_WANDB=1
export WANDB_PROJECT=slime-wordle
export WANDB_GROUP=wordle-grpo-qwen3-4b
# Optional: export WANDB_TEAM=myteam
export WANDB_MODE=online  # online | offline | disabled (default None)
export WANDB_KEY=your_wandb_api_key
export WANDB_DIR=/workspace/logs/wandb # Optional: log dir inside the repo

Now set up the environment variables needed to train:

export HF_CHECKPOINT=$PWD/models/Qwen3-4B
export REF_CKPT_PATH=$PWD/outputs/wordle_sft   # warm start from SFT
export NUM_GPUS=4

Finally, we are ready to train with GRPO:

bash scripts/run_wordle_grpo.sh

After it loads, it will print out a link to track it on W&B if provided a proper API Key. Additionally, the charts can be viewed by going on W&B and clicking on "Projects". It should be titled something like ".../wordle-grpo-qwen3-4b".

Training will last several hours. As training progresses the eval rate will begin to look like this (a sign of success):

When it is complete, Ray will again say the job was successful. Either download the model with a FTP service from the GPU cluster or continue training with more to increase the eval rate.

Congratulations you have successfully post-trained a model. Now, explore the codebase to understand how RL environments are constructed. Questions to think about:

How is the dataset generated?
How are the rewards generated?
Where is the main loop with Slime located and how does it handle all of the logic?

PreviousIntroduction NextVerifiable Evaluation

Last updated 2 months ago

Good evening