The Maturing Art of Fine-Tuning Large Language Models

The landscape of post-training techniques for large language models (LLMs) has changed dramatically in recent years, evolving from relatively simple pipelines to increasingly complex and specialized approaches. This evolution reflects both the growing understanding of how to effectively shape model behavior and the practical challenges of scaling these processes.

From Basic Pipelines to Modular Architectures

The early days of LLM fine-tuning followed a standard pattern: supervised fine-tuning (SFT) on demonstration data, then reinforcement learning from human feedback (RLHF) using reward models. This approach, exemplified by InstructGPT and Llama 2, proved effective but had limitations in terms of scalability and specialization.

More recent approaches have moved towards modular architectures where multiple specialized models are trained independently before being combined—a pattern seen in DeepSeek V4 and MiMo Flash v2. These systems often leverage techniques like Multi-teacher On-Policy Distillation (MOPD), which trains domain-specific teacher models that then guide a general student model.

Key Recipes Along the Way:

InstructGPT (March 2022): The foundational three-step process of SFT → reward modeling → RLHF\n* Llama 2 (July 2023): Expanded on InstructGPT with multi-stage RLHF and separate reward models for helpfulness and safety\n* Tulu 3 (November 2024): Simplified the process to three stages: curated prompts → SFT → DPO with RLVR\n* DeepSeek R1 (January 2025): Pioneered reasoning-focused RL as a central component, using a pure RL approach for initial training\n* MiMo Flash v2: Introduced MOPD, where multiple specialist models are distilled into a single unified model

The Rise of Specialization and Distillation

As LLMs have grown in size and capability, it’s become clear that general-purpose models often benefit from specialized training. DeepSeek V3 evolved through several iterations, culminating in V4 which uses 6 specialist RL agents before distillation—demonstrating a trend towards modularity.

The shift towards specialization is driven by both technical considerations (RL becomes more complex with mixed objectives) and organizational benefits (allowing teams to focus on specific domains).

What’s your experience been with post-training techniques? Let me know in the comments!