Day 020: Stream Processing and Continuous Data Flows

Stream processing stops looking mysterious once you realize the system is not waiting for a dataset to finish; it is continuously computing over events that keep arriving.

Today's "Aha!" Moment

Stay with the order and payment platform from the previous lessons. Orders are placed all day, payments are authorized continuously, refunds happen, carts are abandoned, and fraud signals appear in bursts. If you wait for a nightly batch job, you can eventually learn what happened. But you cannot react while the pattern is forming.

That is the core shift in stream processing. The input is not a closed table you can scan once and be done with. The input is an event flow that keeps moving whether your computation is ready or not. The system has to update results continuously: counts, alerts, rolling metrics, recent-window aggregates, session state, and derived views that are always in the process of becoming more current.

This is why stream processing is not just "queues but faster" and not just "batch but smaller." A queue often models work to be consumed. A stream models a durable flow of events that many consumers may interpret in different ways over time. Once you think in flow, time becomes part of correctness. Questions like "last five minutes," "current session," or "events that happened before payment confirmation" are not reporting details. They define the computation.

Signals that stream processing is the real topic:

data keeps arriving without a natural end
freshness changes the usefulness of the result
several consumers want to derive different views from the same event flow
late or out-of-order events are normal, not bugs

The common mistake is to picture streaming as a batch job that runs more often. That misses the deeper change: the system is doing continuous, stateful computation on unbounded input.

Why This Matters

Modern systems often care about reaction time as much as historical analysis. Fraud detection, operational alerting, live dashboards, feature personalization, logistics monitoring, and session analytics all lose value when the result arrives too late. It is not enough to store events durably. The system also needs to compute on them while the stream is still moving.

This matters because stream processing changes what engineers have to design for. Time semantics become explicit. State lives inside the computation. Backpressure and replay become correctness concerns, not only performance concerns. A result can be "wrong for now" because a late event has not arrived yet, or because the chosen time model does not match the question the business actually cares about.

That is why stream processing belongs naturally after event sourcing, CQRS, and sagas. Those patterns preserve facts and coordinate workflows. Streaming turns those same facts into ongoing computation: projections that are alive, not just periodically rebuilt.

Learning Objectives

By the end of this session, you will be able to:

Explain what makes streaming different - Describe how continuous event flow changes the programming model compared with batch or queue-only thinking.
Reason about windows and time semantics - Explain why event time, processing time, and late data can change the result itself.
Recognize the operational cost of stateful streaming - Identify why state, replay, checkpointing, and backpressure are part of the design, not just implementation details.

Core Concepts Explained

Concept 1: A Stream Is an Ongoing Flow, Not a Finished Dataset

Use a simple payment event flow:

OrderPlaced
PaymentAuthorized
PaymentFailed
RefundIssued
OrderCancelled

Those events do not arrive as one neat file with a final row that signals "computation complete." They keep coming. A stream processor therefore computes incrementally.

Instead of saying:

run one query over the whole dataset
produce one final answer

you are saying:

for each new event:
    update the current answer

That changes the mental model completely. A batch job is about finite input and eventual completion. A stream processor is about maintaining an evolving result over unbounded input.

This is also why streams are not identical to queues. In a queue, the emphasis is usually on distributing work items to consumers. In a stream, the same ordered event flow can be consumed by many processors, each deriving different state or different outputs from the same history.

The trade-off is that continuous computation gives freshness and responsiveness, but it also means the system never really reaches "done." It must stay correct while input keeps moving.

Concept 2: Time Semantics and Windows Define the Meaning of the Result

Most interesting streaming questions are not about all events across all time. They are about time-bounded behavior:

failed payments in the last 10 minutes
orders per customer in the current session
delivery events seen before a timeout window closes

That is why windows exist. They carve an unbounded stream into meaningful slices.

But the deeper subtlety is which clock the computation follows.

event happens at 10:02
event arrives at 10:05
event is processed at 10:05

If your question is about when the business event actually happened, you care about event time. If your system groups by when the processor happened to see the record, you are using processing time. Those can produce different answers when events arrive late.

A tiny example makes this concrete:

Event A: event_time=10:01, arrives=10:01
Event B: event_time=10:02, arrives=10:05

Question: "How many failures happened between 10:00 and 10:03?"

If you group by processing time, Event B may miss the intended window entirely. If you group by event time and allow late arrivals, the result can be corrected later.

That is why time semantics are part of correctness in stream processing. The business question and the time model have to match.

The trade-off is that better time fidelity produces more meaningful results, but it also forces the system to handle late data, window closure policy, and result updates more carefully.

Concept 3: Stateful Streaming Is Where the Real Power and the Real Pain Live

The useful stream processors are usually stateful.

A fraud detector may keep recent failure counts per card. A session tracker may keep the last activity timestamp per user. A rolling KPI pipeline may maintain counters per region and per minute. None of that works as pure stateless pass-through.

An ASCII sketch helps:

event stream
    -> partition by card_id
    -> keep recent failures in state
    -> emit alert when threshold is crossed

That state is what makes the computation valuable, but it also creates the operational challenges:

how the state survives failure
how replay rebuilds the same result
how partitioning keeps related events together
how backpressure stops overload from turning into silent lag

In other words, the stream processor is not only reading events. It is maintaining live memory about the flow, and that memory must be recoverable and consistent enough to trust.

This is where many explanations become too casual. The per-event function may look small, but the real system must checkpoint state, recover from crashes, replay history, and avoid collapsing under bursts. Stateful streaming is powerful precisely because it is a controlled form of continuous memory.

The trade-off is that stateful streaming enables rich, low-latency computation, but it demands careful design around recovery, replay, partitioning, and pressure management.

Troubleshooting

Issue: "Streaming is just batch processing with shorter intervals."
Why it happens / is confusing: Both styles transform data, so the difference can look like latency only.
Clarification / Fix: Streaming changes the computation model itself. The input is unbounded, results evolve continuously, and time semantics are part of the meaning of the output.

Issue: "If events arrive late, the system should just ignore them."
Why it happens / is confusing: Processing-time thinking feels operationally simpler.
Clarification / Fix: If the question is defined over event time, ignoring late data may make the answer wrong. Decide explicitly what lateness your product can tolerate and how updates should be handled.

Issue: "The logic is simple, so the streaming system should be simple too."
Why it happens / is confusing: Per-event code often looks tiny in examples.
Clarification / Fix: The hard part is not the map or the counter increment. The hard part is state recovery, replay, partitioning, and backpressure under failure and bursty load.

Advanced Connections

Connection 1: Event Logs <-> Continuous Projections

The parallel: Event sourcing preserves a durable history, and stream processing turns that history into continuously updated derived views.

Real-world case: The same order and payment events can drive customer timelines, fraud alerts, and live business metrics without waiting for a nightly rebuild.

Connection 2: Streaming <-> CQRS Read Models

The parallel: A CQRS read model can be maintained not only by periodic rebuilds but by an ongoing stream processor that updates the view as events arrive.

Real-world case: Customer order history, support dashboards, and operational counters are often projections maintained continuously from the same event flow.

Resources

Optional Deepening Resources

[ARTICLE] The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction
- Link: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
- Focus: Read it to connect logs, streams, and derived data as one broader systems abstraction.
[BOOK] Designing Data-Intensive Applications
- Link: https://dataintensive.net/
- Focus: Revisit the sections on logs, stream processing, and materialized views as the conceptual backbone of modern streaming systems.
[DOC] Apache Kafka Streams Documentation
- Link: https://kafka.apache.org/documentation/streams/
- Focus: Use it to see stateful operations, windowing, and stream-table style processing in a concrete framework.

Key Insights

Streaming is continuous computation over unbounded input - The system keeps updating results as events arrive rather than waiting for a finite dataset to finish.
Time semantics define correctness - Event time, processing time, windows, and late data are part of the result's meaning, not only implementation detail.
State is both the value and the difficulty - The most useful streaming systems keep live state, which means replay, recovery, partitioning, and backpressure must be designed carefully.

Knowledge Check (Test Questions)

What most clearly distinguishes stream processing from batch processing?
- A) Stream processing only works for tiny inputs.
- B) Stream processing treats input as ongoing event flow and updates results continuously.
- C) Batch systems cannot use logs at all.
Why do windows matter in a streaming system?
- A) Because they let the system define meaningful computations over unbounded flow, such as "last 10 minutes" or "current session."
- B) Because they remove the need to think about time semantics.
- C) Because they guarantee perfect event ordering.
Why do stateful stream processors need extra operational care?
- A) Because they never need recovery or replay.
- B) Because useful streaming state must survive failure, partitioning, replay, and backpressure safely.
- C) Because stateful logic is always simpler than stateless forwarding.

Answers

1. B: Streaming changes the model from finite computation to continuously updated results over unbounded input.

2. A: Windows make an infinite stream computationally meaningful by defining the time-bounded slice the query cares about.

3. B: Stateful processors are powerful because they remember context, but that memory must be recoverable and manageable under real operational pressure.

← Back to Learning