Fine-TuningFrameworks

TRL: Transformer Reinforcement Learning

by Hugging Face

AdvancedDocumentationFreeSelf-paced

The library and docs for post-training LLMs: SFT, DPO, PPO, GRPO, and reward modeling.

Start LearningReviewed July 3, 2026

Overview

TRL is the practical toolkit for the entire post-training stack. Its documentation covers supervised fine-tuning (SFT), preference optimization methods like DPO, classic RLHF with PPO, newer reasoning approaches like GRPO, and reward-model training — each with a trainer class and example scripts. When you're ready to move past basic fine-tuning into alignment and preference tuning, these docs are where you'll live.

At a Glance

Topic
Fine-Tuning
Level
Advanced
Format
Documentation
Cost
Free
Duration
Self-paced
Provider
Hugging Face
Hands-on
Yes — code/exercises
Certificate
None

What You’ll Learn

  • Supervised fine-tuning (SFT) with the SFTTrainer
  • Preference tuning with DPO
  • RLHF with PPO and reasoning with GRPO
  • Training and using reward models

Highlights

  • Covers the full modern post-training stack
  • Trainer classes and runnable scripts

Who It’s For

Best For

  • Advanced practitioners doing alignment/preference tuning

Prerequisites

  • PyTorch
  • Fine-tuning fundamentals

FAQ

What is TRL: Transformer Reinforcement Learning?

Official docs for TRL, Hugging Face's library for supervised fine-tuning and preference/RL post-training of language models.

Is TRL: Transformer Reinforcement Learning free?

TRL: Transformer Reinforcement Learning is free to access.

What level is TRL: Transformer Reinforcement Learning for?

TRL: Transformer Reinforcement Learning is aimed at a advanced audience. Recommended background: PyTorch, Fine-tuning fundamentals.

How long does TRL: Transformer Reinforcement Learning take?

Expect roughly Self-paced. Most learners work through it at their own pace.

What will I learn from TRL: Transformer Reinforcement Learning?

You'll learn: Supervised fine-tuning (SFT) with the SFTTrainer; Preference tuning with DPO; RLHF with PPO and reasoning with GRPO; Training and using reward models.

Topics

TRLDPOPPOGRPOSFTalignment