Logistic Regression Fundamentals

Day 101: Logistic Regression Fundamentals

Logistic regression matters because many ML problems are not really "predict a number" problems. They are "how likely is this example to belong to the positive class?" problems.


Today's "Aha!" Moment

Linear regression predicted a number. Logistic regression keeps the same core idea of a weighted score, but changes what that score means.

Keep one example throughout the lesson. The learning platform wants to predict whether a student is at high risk of dropping out of a cohort. Inputs might include attendance, assignment completion, quiz trend, and recent inactivity. We do not just want a hard yes/no label immediately. What we really want first is a risk estimate: how likely is dropout for this student?

That is the aha. Logistic regression does not jump straight from features to a class label. It first builds a linear score from the features, then converts that score into a probability between 0 and 1. Only after that does the system apply a threshold to turn probability into an action.

Once you see the model in that order, the name becomes less confusing. It is called "regression" because it still learns weights for a linear score, but its job is classification. That makes it a very clean bridge from the regression lessons to the world of classifiers.


Why This Matters

The problem: Many systems need to classify uncertain cases, but a hard label alone hides the most useful part of the model output: how confident the model is.

Before:

After:

Real-world impact: Logistic regression remains a strong production baseline for spam filtering, churn prediction, medical screening, moderation, fraud scoring, and triage because it is simple, fast, and interpretable enough to support real decisions.


Learning Objectives

By the end of this session, you will be able to:

  1. Explain how logistic regression produces a class probability - Connect a linear score, the sigmoid function, and a probability output.
  2. Explain why training optimizes probabilities rather than just labels - Understand the role of log loss at a high level.
  3. Reason about thresholds as operating policy - See why the best threshold depends on the cost of mistakes.

Core Concepts Explained

Concept 1: Logistic Regression Converts a Linear Score into a Probability

The model starts exactly where recent lessons would lead you to expect: with a weighted sum of features.

For the dropout-risk example:

score =
    bias
  + w1 * attendance
  + w2 * quiz_trend
  + w3 * assignment_completion
  + ...

But unlike linear regression, that raw score is not the final prediction. It is passed through the sigmoid function, which squashes any real number into the range 0 to 1.

import math

def sigmoid(z):
    return 1 / (1 + math.exp(-z))

score = 1.8
dropout_probability = sigmoid(score)

This changes the interpretation completely:

The useful mental model is simple: the linear score gathers evidence, and the sigmoid turns that evidence into a probability-like confidence for the positive class.

The trade-off is simplicity versus expressive power. You get a clean interpretable classifier with a linear decision boundary, but it may miss highly nonlinear class structure unless features are engineered well.

Concept 2: Training Cares About Confidence, Not Just Being on the Correct Side of the Boundary

Suppose two students really do drop out:

If the threshold is 0.5, both are technically classified correctly. But those predictions are not equally good. Student B was recognized with much stronger confidence. Logistic regression training reflects that difference.

This is why the model is usually trained with log loss. At a high level, log loss rewards high probability on the correct class and penalizes confident mistakes strongly.

good:
  true positive with high probability

bad:
  true positive with very low probability
  false positive with very high probability

That means the training process is not only trying to put examples on the correct side of a decision boundary. It is also trying to produce useful probability estimates.

This matters in practice because many real systems do more than emit labels. They rank cases by risk, send uncertain cases to humans, or apply different workflows depending on model confidence.

The trade-off is slightly more conceptual complexity versus much more useful output. A probability is often more actionable than a bare label because it preserves uncertainty.

Concept 3: The Threshold Is Not Part of Nature, It Is Part of the Operating Policy

Once the model outputs a probability, the product still needs a rule for turning that number into action.

For dropout risk:

probability >= threshold -> predict positive class
probability < threshold  -> predict negative class

This is the key decision:

That is why 0.5 is not automatically correct. It is only one possible operating point.

For example:

The trade-off is recall versus precision, or more generally, catching more positives versus avoiding more false alarms. The model gives the risk estimate. The threshold expresses the business or operational policy.

Troubleshooting

Issue: Thinking logistic regression should predict a continuous numeric target because the name says "regression."

Why it happens / is confusing: The terminology points backward to the weighted linear score, not to the final task type.

Clarification / Fix: Focus on the output. Logistic regression is used for classification because the score is converted into a class probability.

Issue: Treating a predicted probability as certainty.

Why it happens / is confusing: Values like 0.92 sound definitive.

Clarification / Fix: Read the output as model confidence under the learned pattern, not as proof. Good evaluation still matters.

Issue: Assuming 0.5 is always the right threshold.

Why it happens / is confusing: It is the default threshold in many examples, so it looks like a built-in truth.

Clarification / Fix: Choose the threshold based on the cost of false positives, false negatives, and class imbalance in the real task.


Advanced Connections

Connection 1: Logistic Regression ↔ Neural Classifiers

The parallel: Logistic regression is closely related to the final sigmoid output layer used in many neural binary classifiers.

Real-world case: Once you understand logistic regression, later binary classifiers feel more like richer feature extractors feeding a familiar probability mechanism.

Connection 2: Logistic Regression ↔ Risk Ranking

The parallel: Many systems care less about one fixed label and more about ordering cases by estimated risk.

Real-world case: Fraud, churn, moderation, and triage workflows often use the probability score directly to prioritize review or intervention.


Resources

Optional Deepening Resources


Key Insights

  1. Logistic regression predicts probability before it predicts a class - The class label is a thresholded action on top of the probability.
  2. Training cares about confidence, not just correctness - Log loss rewards sensible probabilities and penalizes confident mistakes heavily.
  3. Thresholds are policy decisions - The same model can support different operating points depending on the real cost of errors.

Knowledge Check (Test Questions)

  1. What is the main job of the sigmoid function in logistic regression?

    • A) Convert a linear score into a value between 0 and 1 that can be interpreted as class probability.
    • B) Remove the need for training data.
    • C) Guarantee perfect calibration automatically.
  2. Why is logistic regression usually trained with log loss?

    • A) Because it rewards good probability estimates and penalizes confident wrong predictions strongly.
    • B) Because it removes the need for thresholds.
    • C) Because it only works when classes are perfectly balanced.
  3. When might a threshold above 0.5 be a better choice?

    • A) When false positives are especially costly and the system should be more conservative before predicting the positive class.
    • B) When you want to maximize recall at any cost.
    • C) When the model should stop outputting probabilities.

Answers

1. A: The sigmoid turns an unrestricted linear score into a bounded probability-like output for the positive class.

2. A: Log loss pushes the model toward probabilities that reflect real confidence, not just the correct side of the boundary.

3. A: A higher threshold demands stronger evidence before taking the positive action, which can reduce costly false alarms.



← Back to Learning