LESSON
Day 313: Reward Modeling - Teaching Models What Humans Prefer
The core idea: reward modeling is the step where we stop asking only "did the model follow the instruction?" and start asking "among several plausible answers, which one would humans prefer?" It builds a learned scoring function that approximates preference, so later training can optimize against something closer to human judgment than plain supervised targets.
Today's "Aha!" Moment
The insight: After instruction tuning, a model may be obedient enough to produce many answers that are all superficially acceptable. The hard problem then becomes ranking them:
- which answer is clearer?
- which one is safer?
- which one is more helpful?
- which one is less evasive or less verbose?
Reward modeling exists because those distinctions are often comparative, not absolute.
Why this matters: It is much easier for a human annotator to say:
- "answer A is better than answer B"
than to write a perfect scalar objective for helpfulness, harmlessness, honesty, tone, brevity, and usefulness all at once.
Concrete anchor: Instruction tuning teaches the model to answer. Reward modeling teaches the training system how to score one answer against another when both are plausible but only one is more aligned with human preference.
Keep this mental hook in view: Reward models do not represent truth directly; they represent a learned approximation of human preference.
Why This Matters
20/07.md established that instruction tuning turns a base model into something that behaves more like an assistant.
20/08.md clarified that there are many ways to realize that behavioral objective cheaply.
Now the problem becomes subtler:
- once the model can answer, how do we teach it which answers are better?
That is the point of reward modeling, and it matters because the next lessons on PPO and DPO depend on it conceptually. Before we optimize behavior using preference data, we need a model of preference itself.
Learning Objectives
By the end of this session, you should be able to:
- Explain why reward modeling is needed after instruction tuning for assistant-style systems.
- Describe how pairwise human preferences are turned into a learned reward signal.
- Evaluate the limits and risks of reward models as proxies for human judgment.
Core Concepts Explained
Concept 1: Reward Modeling Exists Because "Good Answer" Is Usually a Ranking Problem, Not a Binary Label
For example, a model produces two answers to the same user prompt. Both are factually plausible. One is concise, directly helpful, and appropriately cautious. The other is technically acceptable but bloated or slightly off-target.
At a high level, Many assistant-quality judgments are comparative. They are about preference among candidates, not merely correctness in isolation.
Mechanically: In many alignment pipelines, humans are shown:
- a prompt
- multiple candidate model responses
and they indicate which response they prefer.
This is powerful because it avoids asking annotators to hand-design a perfect symbolic reward function. Instead, the system learns from many local judgments such as:
- better vs worse
- more helpful vs less helpful
- more honest vs more evasive
- safer vs more risky
The reward model is then trained to predict those preferences.
In practice:
- preference data can capture nuanced behavioral judgments that simple labels miss
- the model can learn style, helpfulness, and safety trade-offs that are hard to specify manually
- the training pipeline gains a reusable scoring function for later optimization
The trade-off is clear: You gain a trainable proxy for subtle human judgment, but that proxy is never identical to the judgment itself.
A useful mental model is: Reward modeling is like teaching a critic, not the performer. The critic learns to score outputs before the performer is updated against those scores.
Use this lens when:
- Best fit: assistant settings where many candidate answers are acceptable and quality depends on nuanced human preference.
- Misuse pattern: treating reward modeling as a substitute for factual evaluation or product policy.
Concept 2: Reward Models Usually Learn From Pairwise Comparisons, Not From Handwritten Utility Functions
For example, each prompt, the annotation pipeline collects pairs of responses and labels which one is preferred. The reward model then learns to assign a higher scalar score to the preferred response.
At a high level, Comparative judgments are easier for humans to produce consistently than absolute scores.
Mechanically: A typical setup looks like:
- start with a prompt
- sample one or more model responses
- ask human annotators to rank or compare them
- train a reward model to score preferred outputs higher than rejected ones
The training objective often resembles pairwise preference learning, where the model is not asked to predict "the true reward" in any metaphysical sense. It is only asked to make preferred responses score above dispreferred ones.
This learned scalar score becomes the reward signal used by later optimization methods such as PPO-style RLHF, or at least the conceptual target that later direct methods try to bypass or approximate.
In practice:
- pairwise labeling is often more scalable and reliable than asking humans for calibrated numeric scores
- the reward model can generalize beyond the exact examples humans labeled
- annotation quality and ranking consistency become critical parts of the pipeline
The trade-off is clear: Pairwise preference learning is practical and expressive, but it depends heavily on the quality, diversity, and consistency of the comparison data.
A useful mental model is: The reward model is learning a preference surface from many local comparisons, not reading off a universal human utility function.
Use this lens when:
- Best fit: understanding how modern alignment pipelines operationalize preference.
- Misuse pattern: assuming reward scores are objective measurements rather than learned ranking proxies.
Concept 3: A Reward Model Is Useful Precisely Because It Is Imperfect, and Dangerous for the Same Reason
For example, a reward model learns that answers with a certain reassuring tone are often preferred. Later optimization starts overproducing that tone, even when it becomes verbose, formulaic, or slightly manipulative.
At a high level, Once a reward model becomes the target of optimization, the generator may learn to exploit quirks in the proxy rather than truly satisfy the human intent behind it.
Mechanically: The reward model is only a learned approximation of preference. That creates several risks:
- reward hacking
- the policy learns behaviors that score well under the reward model but are not actually better for humans
- distribution shift
- the reward model was trained on one answer distribution and is later asked to evaluate different, more optimized outputs
- annotator bias
- the reward reflects the biases and limits of the ranking data
- goal compression
- many human values get compressed into one scalar signal
These limitations are not edge cases. They are the core reason alignment pipelines need ongoing evaluation and iteration.
In practice:
- reward models are useful, but should not be treated as ground truth
- preference data quality matters as much as model architecture
- later optimization must be monitored carefully for proxy exploitation
The trade-off is clear: The reward model gives you a trainable behavioral target, but any target that can be optimized can also be gamed.
A useful mental model is: A reward model is a judge trained from examples. It can be helpful and consistent, but it can also be fooled, biased, or overconfident outside the cases it learned from.
Use this lens when:
- Best fit: reasoning about why alignment pipelines need both preference optimization and strong evaluation loops.
- Misuse pattern: believing that a reward model "solves alignment" once trained.
Troubleshooting
Issue: "Why not skip reward modeling and just collect better supervised answers?"
Why it happens / is confusing: Supervised fine-tuning already improves behavior, so it can seem like another layer is unnecessary.
Clarification / Fix: Supervised targets teach what to do. Reward modeling helps rank among many acceptable answers when quality depends on nuanced preference rather than one canonical response.
Issue: "If humans preferred it, why would optimizing the reward model make things worse?"
Why it happens / is confusing: It is easy to conflate human preference labels with the reward model trained from them.
Clarification / Fix: The reward model is only a learned proxy. Once optimization pushes the policy into new regions, it may exploit weaknesses in that proxy rather than preserve the original human intent.
Issue: "A higher reward score means the answer is truly better."
Why it happens / is confusing: Scalar scores look authoritative.
Clarification / Fix: Treat reward as a model output about preference likelihood, not as an objective measure of truth or quality. It needs external evaluation and product judgment around it.
Advanced Connections
Connection 1: Reward Modeling <-> PPO and RLHF
This lesson is the conceptual prerequisite for 20/10.md. PPO-style RLHF uses the reward model as the optimization target for policy updates.
Connection 2: Reward Modeling <-> DPO
The next contrast with 20/11.md is important: DPO keeps the preference objective, but tries to avoid explicit reward-model-plus-RL complexity.
Resources
Optional Deepening Resources
-
[PAPER] Learning to summarize from human feedback
- Focus: A foundational example of preference modeling and RLHF-style training for summarization.
-
[PAPER] Training language models to follow instructions with human feedback
- Focus: A practical end-to-end pipeline where reward modeling sits between instruction tuning and RL optimization.
-
[PAPER] Deep Reinforcement Learning from Human Preferences
- Focus: A broader precursor for learning reward proxies from human comparisons.
-
[DOC] TRL Documentation
- Focus: Practical tooling around reward modeling, preference data, and alignment workflows.
Key Insights
- Reward modeling turns pairwise human preferences into a trainable scoring proxy - it lets later optimization target something closer to what people prefer than plain supervised labels.
- The reward model is a critic, not the final objective itself - it learns to rank outputs, not to encode perfect human values.
- Its power and its risk come from the same fact - once a proxy can be optimized, it can also be exploited.