The Evolution of Reinforcement Fine-Tuning in AI

Travis Addair on Model Adaptation, SFT vs. RFT, Reward Functions, Data, & Future Trends.

Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

Travis Addair is Co-Founder & CTO at Predibase. In this episode, the discussion centers on transforming pre-trained foundation models into domain-specific assets through advanced customization techniques. The conversation covers contrasting fine-tuning approaches—Supervised (SFT) versus Reinforcement (RFT)—and delves into the design of reward functions, challenges of data quality and scarcity, and real-world applications ranging from code generation to entity extraction. The episode also highlights the importance of UX and tooling in streamlining model deployment and explores future trends in post-training optimization.

Subscribe to the Gradient Flow Newsletter

 

Interview highlights – key sections from the video version:



Related content:


Support our work by subscribing to our newsletter📩


Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Q: What is Reinforcement Fine-Tuning (RFT) and how does it compare to Supervised Fine-Tuning (SFT)?

A: Reinforcement Fine-Tuning represents a merger of reinforcement learning methods with supervised fine-tuning tasks. While traditional reinforcement learning in LLMs has focused on subjective preference alignment (RLHF), RFT applies reinforcement learning techniques to objective tasks where there are clear right and wrong answers.

SFT treats model training as a memorization problem – the model is penalized for any deviation from reference examples. RFT, however, allows more flexibility by rewarding the model for directionally correct outputs, even if they don’t exactly match reference solutions. This enables models to discover creative approaches and sophisticated reasoning strategies that might be superior to what we could explicitly specify through SFT alone.

Q: What specific problems is RFT best suited for?

A: RFT excels at tasks with objective success criteria, particularly when partial credit can be given. Code generation tasks like natural language to SQL or code translation between frameworks are perfect examples. The model can be rewarded for getting function names correct or having proper syntax, even if the final output isn’t perfect.

RFT is especially valuable for reasoning tasks. As demonstrated by DeepSeek’s work, reinforcement learning can help models develop sophisticated reasoning strategies through self-discovery that would be impossible to achieve through supervised fine-tuning alone, which requires explicitly teaching every reasoning step.

Q: What are the key limitations of traditional supervised fine-tuning?

A: The biggest limitation of SFT is the high data requirement. Supervised fine-tuning typically needs thousands or tens of thousands of labeled examples, which can be difficult to obtain, especially for domain-specific tasks requiring expert knowledge.

SFT also struggles with generalization from small datasets. With limited examples, SFT models tend to overfit and just memorize the training data rather than learning underlying patterns. Additionally, SFT can’t effectively teach a model to discover novel reasoning approaches or solutions that differ from provided examples but might be equally or more effective.

Q: How does the workflow for RFT differ from supervised fine-tuning?

A: In SFT, the workflow is centered around providing demonstrations – showing the model examples of correct inputs and outputs. With RFT, you’re instead focused on providing critique on what the model is doing.

Think of yourself as a teacher grading papers. Rather than just marking answers right or wrong, you’re providing feedback on what aspects of the solution work and what needs improvement. This might involve writing reward functions that assess different components of the model’s output and assign appropriate scores.

The process is more interactive than SFT. You observe what the model is generating, identify where it’s getting stuck or making mistakes, then update your reward functions to help it overcome those obstacles.

Q: How many examples are typically needed for RFT versus SFT?

A: One of the major advantages of RFT is its sample efficiency. Where SFT might require thousands or tens of thousands of examples, RFT can show meaningful improvements with as few as 10 examples.

In our testing, even with just 10 examples, RFT models showed significant improvement over the baseline, while SFT models using the same 10 examples actually performed worse because they overfitted. There’s research supporting this trend, suggesting that “SFT memorizes, RL generalizes.”

The exact number needed depends on the complexity of your task, but starting with just 10-100 examples is often sufficient for RFT to begin learning effectively, while SFT typically requires at least 1,000 examples for meaningful gains.

Q: How do reward functions work in RFT?

A: Reward functions are essentially automated graders that evaluate the model’s outputs. They define what “good” looks like for your specific task, allowing you to score the model’s responses on a scale rather than just marking them right or wrong.

For example, in a code generation task, your reward function might check if the code compiles, if it produces the expected output, and if it follows best practices. For data extraction tasks, you might verify that the model correctly identified all required fields.

You don’t manually grade each output – instead, you write functions that automatically assess outputs against your criteria. This might involve comparing to ground truth when available, executing generated code to verify functionality, or applying other validation rules specific to your domain.

Reward functions can get quite sophisticated, detecting when models are “reward hacking” (finding loopholes that maximize rewards without achieving the actual goal) and evolving to provide more nuanced guidance as training progresses.

Q: Can you provide examples of how RFT works in practical applications?

A: For code translation (like PyTorch to Triton), we can use execution-based rewards. We execute both the original code and the translated code with test inputs, then verify they produce the same outputs. This approach requires minimal labeled data – just source code and test cases.

For entity extraction tasks, we can write reward functions that evaluate each field independently. If the model correctly extracts a company name but misses the address, we might give it partial credit (0.2 out of 1.0). This granular feedback helps the model learn more efficiently than binary right/wrong signals.

For reasoning tasks, we can reward intermediate steps toward a solution. This is what enabled DeepSeek to induce their “aha moment” where models discovered sophisticated reasoning strategies through self-exploration.

One important consideration in practical applications is preventing reward hacking. In our PyTorch to Triton work, the model initially tried to “cheat” by solving the problem in PyTorch and then writing a Triton function that just returned that result. We had to modify our reward function to detect and penalize this behavior.

Q: Should teams use RFT instead of SFT, or can they be complementary?

A: SFT and RFT are complementary approaches that work well together. As demonstrated in DeepSeek’s work comparing R1-Zero and R1, using SFT to establish baseline reasoning abilities before applying RFT can address the “cold start” problem where a model isn’t sufficiently tuned for your specific task.

A common workflow is to use SFT with available labeled data to give the model a foundation in your domain, then use RFT to discover more advanced capabilities or optimize for specific metrics. This combination often yields better results than either approach alone.

For tasks where you have plenty of high-quality labeled data, SFT might be sufficient. For tasks with limited data or where you want to push beyond the limitations of your training examples, RFT becomes valuable.

Q: What timeframe should teams expect for RFT compared to SFT?

A: RFT is generally more computationally intensive per step than SFT, since each iteration involves generating multiple examples, scoring them, computing losses for both the original and new model, and then backpropagating.

However, because RFT requires fewer examples, the total training time can be comparable. You should budget a few hours for the RFT process, though more complex tasks may take longer. Platforms like Prettybase handle the infrastructure aspects, so you can make adjustments to reward functions during training without worrying about crashes or failures.

As the technology matures, we expect RFT to become faster and more accessible, just as SFT has evolved from a complex research technique to a commodity service.

Q: How is the user experience for RFT evolving?

A: The goal is to make RFT accessible to domain experts without requiring deep technical expertise. While current implementations often involve writing reward functions in code, we’re working on natural language interfaces where experts can specify criteria in plain English.

The ideal workflow might involve showing a domain expert examples of the model’s current outputs, allowing them to identify what’s wrong with each, then using that feedback to automatically construct reward functions. This would enable a more intuitive, no-code approach to model customization.

Future platforms will likely combine SFT and RFT in unified workflows, where users don’t need to worry about the underlying techniques but can simply follow a guided process that uses the most appropriate method for their data and requirements.

As foundation models continue to evolve, including reasoning-enhanced and multimodal models, the tools for customizing them will need to evolve as well. The companies that succeed will be those that can efficiently adapt these powerful models to their specific domains and use cases.