Reward Modeling - Teaching Models What Humans Prefer

LESSON

LLM Training, Alignment, and Serving

009 30 min intermediate

Day 313: Reward Modeling - Teaching Models What Humans Prefer

The core idea: reward modeling is the step where we stop asking only "did the model follow the instruction?" and start asking "among several plausible answers, which one would humans prefer?" It builds a learned scoring function that approximates preference, so later training can optimize against something closer to human judgment than plain supervised targets.


Today's "Aha!" Moment

The insight: After instruction tuning, a model may be obedient enough to produce many answers that are all superficially acceptable. The hard problem then becomes ranking them:

Reward modeling exists because those distinctions are often comparative, not absolute.

Why this matters: It is much easier for a human annotator to say:

than to write a perfect scalar objective for helpfulness, harmlessness, honesty, tone, brevity, and usefulness all at once.

Concrete anchor: Instruction tuning teaches the model to answer. Reward modeling teaches the training system how to score one answer against another when both are plausible but only one is more aligned with human preference.

Keep this mental hook in view: Reward models do not represent truth directly; they represent a learned approximation of human preference.


Why This Matters

20/07.md established that instruction tuning turns a base model into something that behaves more like an assistant.

20/08.md clarified that there are many ways to realize that behavioral objective cheaply.

Now the problem becomes subtler:

That is the point of reward modeling, and it matters because the next lessons on PPO and DPO depend on it conceptually. Before we optimize behavior using preference data, we need a model of preference itself.


Learning Objectives

By the end of this session, you should be able to:

  1. Explain why reward modeling is needed after instruction tuning for assistant-style systems.
  2. Describe how pairwise human preferences are turned into a learned reward signal.
  3. Evaluate the limits and risks of reward models as proxies for human judgment.

Core Concepts Explained

Concept 1: Reward Modeling Exists Because "Good Answer" Is Usually a Ranking Problem, Not a Binary Label

For example, a model produces two answers to the same user prompt. Both are factually plausible. One is concise, directly helpful, and appropriately cautious. The other is technically acceptable but bloated or slightly off-target.

At a high level, Many assistant-quality judgments are comparative. They are about preference among candidates, not merely correctness in isolation.

Mechanically: In many alignment pipelines, humans are shown:

and they indicate which response they prefer.

This is powerful because it avoids asking annotators to hand-design a perfect symbolic reward function. Instead, the system learns from many local judgments such as:

The reward model is then trained to predict those preferences.

In practice:

The trade-off is clear: You gain a trainable proxy for subtle human judgment, but that proxy is never identical to the judgment itself.

A useful mental model is: Reward modeling is like teaching a critic, not the performer. The critic learns to score outputs before the performer is updated against those scores.

Use this lens when:

Concept 2: Reward Models Usually Learn From Pairwise Comparisons, Not From Handwritten Utility Functions

For example, each prompt, the annotation pipeline collects pairs of responses and labels which one is preferred. The reward model then learns to assign a higher scalar score to the preferred response.

At a high level, Comparative judgments are easier for humans to produce consistently than absolute scores.

Mechanically: A typical setup looks like:

  1. start with a prompt
  2. sample one or more model responses
  3. ask human annotators to rank or compare them
  4. train a reward model to score preferred outputs higher than rejected ones

The training objective often resembles pairwise preference learning, where the model is not asked to predict "the true reward" in any metaphysical sense. It is only asked to make preferred responses score above dispreferred ones.

This learned scalar score becomes the reward signal used by later optimization methods such as PPO-style RLHF, or at least the conceptual target that later direct methods try to bypass or approximate.

In practice:

The trade-off is clear: Pairwise preference learning is practical and expressive, but it depends heavily on the quality, diversity, and consistency of the comparison data.

A useful mental model is: The reward model is learning a preference surface from many local comparisons, not reading off a universal human utility function.

Use this lens when:

Concept 3: A Reward Model Is Useful Precisely Because It Is Imperfect, and Dangerous for the Same Reason

For example, a reward model learns that answers with a certain reassuring tone are often preferred. Later optimization starts overproducing that tone, even when it becomes verbose, formulaic, or slightly manipulative.

At a high level, Once a reward model becomes the target of optimization, the generator may learn to exploit quirks in the proxy rather than truly satisfy the human intent behind it.

Mechanically: The reward model is only a learned approximation of preference. That creates several risks:

These limitations are not edge cases. They are the core reason alignment pipelines need ongoing evaluation and iteration.

In practice:

The trade-off is clear: The reward model gives you a trainable behavioral target, but any target that can be optimized can also be gamed.

A useful mental model is: A reward model is a judge trained from examples. It can be helpful and consistent, but it can also be fooled, biased, or overconfident outside the cases it learned from.

Use this lens when:


Troubleshooting

Issue: "Why not skip reward modeling and just collect better supervised answers?"

Why it happens / is confusing: Supervised fine-tuning already improves behavior, so it can seem like another layer is unnecessary.

Clarification / Fix: Supervised targets teach what to do. Reward modeling helps rank among many acceptable answers when quality depends on nuanced preference rather than one canonical response.

Issue: "If humans preferred it, why would optimizing the reward model make things worse?"

Why it happens / is confusing: It is easy to conflate human preference labels with the reward model trained from them.

Clarification / Fix: The reward model is only a learned proxy. Once optimization pushes the policy into new regions, it may exploit weaknesses in that proxy rather than preserve the original human intent.

Issue: "A higher reward score means the answer is truly better."

Why it happens / is confusing: Scalar scores look authoritative.

Clarification / Fix: Treat reward as a model output about preference likelihood, not as an objective measure of truth or quality. It needs external evaluation and product judgment around it.


Advanced Connections

Connection 1: Reward Modeling <-> PPO and RLHF

This lesson is the conceptual prerequisite for 20/10.md. PPO-style RLHF uses the reward model as the optimization target for policy updates.

Connection 2: Reward Modeling <-> DPO

The next contrast with 20/11.md is important: DPO keeps the preference objective, but tries to avoid explicit reward-model-plus-RL complexity.


Resources

Optional Deepening Resources


Key Insights

  1. Reward modeling turns pairwise human preferences into a trainable scoring proxy - it lets later optimization target something closer to what people prefer than plain supervised labels.
  2. The reward model is a critic, not the final objective itself - it learns to rank outputs, not to encode perfect human values.
  3. Its power and its risk come from the same fact - once a proxy can be optimized, it can also be exploited.

PREVIOUS Parameter-Efficient Fine-Tuning (PEFT) Comparison NEXT PPO for LLMs - Optimizing Language Models with Reinforcement Learning

← Back to LLM Training, Alignment, and Serving

← Back to Learning Hub