TRL: Transformer Reinforcement Learning

by Hugging Face

AdvancedDocumentationFreeSelf-paced

The library and docs for post-training LLMs: SFT, DPO, PPO, GRPO, and reward modeling.

Start LearningReviewed July 3, 2026

Overview

TRL is the practical toolkit for the entire post-training stack. Its documentation covers supervised fine-tuning (SFT), preference optimization methods like DPO, classic RLHF with PPO, newer reasoning approaches like GRPO, and reward-model training — each with a trainer class and example scripts. When you're ready to move past basic fine-tuning into alignment and preference tuning, these docs are where you'll live.

At a Glance

Topic: Fine-Tuning
Level: Advanced
Format: Documentation
Cost: Free
Duration: Self-paced
Provider: Hugging Face
Hands-on: Yes — code/exercises
Certificate: None

What You’ll Learn

✓Supervised fine-tuning (SFT) with the SFTTrainer
✓Preference tuning with DPO
✓RLHF with PPO and reasoning with GRPO
✓Training and using reward models

Highlights

•Covers the full modern post-training stack
•Trainer classes and runnable scripts

Who It’s For

Best For

✓Advanced practitioners doing alignment/preference tuning

Prerequisites

•PyTorch
•Fine-tuning fundamentals

FAQ

What is TRL: Transformer Reinforcement Learning?

Official docs for TRL, Hugging Face's library for supervised fine-tuning and preference/RL post-training of language models.

Is TRL: Transformer Reinforcement Learning free?

TRL: Transformer Reinforcement Learning is free to access.

What level is TRL: Transformer Reinforcement Learning for?

TRL: Transformer Reinforcement Learning is aimed at a advanced audience. Recommended background: PyTorch, Fine-tuning fundamentals.

How long does TRL: Transformer Reinforcement Learning take?

Expect roughly Self-paced. Most learners work through it at their own pace.

What will I learn from TRL: Transformer Reinforcement Learning?

You'll learn: Supervised fine-tuning (SFT) with the SFTTrainer; Preference tuning with DPO; RLHF with PPO and reasoning with GRPO; Training and using reward models.

Topics

TRLDPOPPOGRPOSFTalignment