Time and Causality in Distributed Systems

Day 008: Time and Causality in Distributed Systems

Time doesn't exist globally in distributed systems—only causality matters


Topic: Logical clocks and causality

Today's "Aha!" Moment

The insight: Time doesn't exist globally in distributed systems. Physical clocks lie. The only truth is causality—what happened before what—captured by logical clocks.

Why this matters: This is when distributed systems go from engineering to philosophy. Einstein's relativity says "simultaneity is relative." Lamport's logical clocks say the same thing for computers. You can't say two events on different machines happened "at the same time"—only that one caused the other, or they're concurrent. This isn't academic; it's why distributed databases choose eventual consistency.

The pattern: Causality matters, wall-clock time doesn't (in distributed systems)

How to recognize it:

Common misconceptions before the Aha!:

Real-world examples:

  1. Git commits: Hash-based causality, not timestamps (you can commit "in the past")
  2. Cassandra/DynamoDB: Vector clocks or LWW (last-write-wins) with causal ordering
  3. Spanner (Google): TrueTime API with atomic clocks because physical time is hard
  4. Blockchain: Block height = logical clock (block N causally depends on block N-1)
  5. Collaborative editing (Google Docs): CRDTs use logical timestamps to merge edits

What changes after this realization:

Meta-insight: Physics and computer science converge at distributed systems. Einstein: "Simultaneity is relative to the observer." Lamport: "Concurrent events have no global order." Both say the same thing: the universe is a distributed system without a global clock. Our intuition about time (Newton's absolute time) is wrong at cosmic scales AND network scales. Nature and networks both force us to think in terms of causality, not timestamps.

Why This Matters

Today you're tackling one of the most fundamental problems in distributed systems: how to establish order of events when there's no global clock. This powers Apache Cassandra (Netflix, Apple, Instagram), version control systems (Git's conflict resolution), and many other critical systems.

Daily Objective

Leslie Lamport's papers you'll read today are among the most influential in computer science history. His work on logical clocks and Paxos won him the Turing Award (the Nobel Prize of computing). You're about to learn concepts that changed how we build distributed systems forever!

Bridge local synchronization concepts with distributed consensus algorithms and explore real-world distributed coordination challenges that power systems you use every day.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain the happens-before relationship - Understand causality in distributed systems and why physical clocks are unreliable
  2. Implement vector clocks - Build a working vector clock system that tracks causality across distributed processes
  3. Apply distributed mutual exclusion - Implement the Ricart-Agrawala algorithm for coordinating access to shared resources
  4. Understand Paxos consensus - Grasp the fundamental concepts behind distributed consensus algorithms
  5. Trace causal relationships - Analyze message flows and identify concurrent vs causally-ordered events
  6. Connect to real systems - Relate logical clocks to production systems like Cassandra, Git, and DynamoDB

Core Concepts Explained

1. The Problem: Time in Distributed Systems

Advanced Distributed Systems - Coordination Patterns

In a single-machine system, you can trust the system clock to order events. Event A happened before Event B if timestamp_A < timestamp_B. Simple.

In a distributed system, this breaks down:

Why physical clocks fail:

Physical clocks can disagree even with NTP:

Real-world consequences:

2. Lamport Timestamps: Logical Time

The insight: If we can't trust wall-clock time, create a logical notion of time based on causality.

Lamport's happens-before relation (→):

Lamport clock algorithm:

Each process maintains a counter (starts at 0):

  1. Before each event, increment counter
  2. When sending message, include current counter value
  3. When receiving message, set counter = max(local_counter, message_counter) + 1

Example:

Process P1:  [0] increment [1] send(1)
Process P2:  [0] receive(1) max(0,1)+1=[2] increment [3]

Properties:

3. Vector Clocks: Full Causality Tracking

Limitation of Lamport clocks: Can't detect concurrency

Vector clocks: Each process maintains a vector of counters (one per process)

Algorithm:

Comparing vector clocks:

Example:

P1: [1,0,0] send [2,0,0]
P2: [0,1,0] receive [2,2,0] send [2,3,0]
P3: [0,0,1] receive [2,3,2]

Now we can detect: P1's first event || P2's first event (concurrent), but P3's event happened-after both.

Real-world usage:

4. Distributed Mutual Exclusion

Problem: How do processes coordinate access to a shared resource without a central coordinator?

Ricart-Agrawala algorithm:

Uses Lamport timestamps + message passing:

  1. Process wants critical section → broadcasts REQUEST(timestamp, process_id) to all
  2. Other processes:
    • If not in CS and not wanting it → reply OK immediately
    • If in CS → defer reply (queue the request)
    • If wanting CS → compare timestamps, reply OK to lower timestamp (higher priority)
  3. Enter CS when received OK from all processes
  4. On exit → send deferred OK replies

Properties:

Cost: 2(N-1) messages per CS entry (N-1 requests + N-1 replies)

5. Consensus: Paxos Introduction

The problem: Multiple processes need to agree on a single value, despite failures

Why it's hard:

Paxos intuition (simplified):

Two phases: Prepare and Accept

Phase 1 - Prepare:

Phase 2 - Accept:

Key insight: Using proposal numbers prevents conflicts. Higher numbered proposals override lower ones.

Real implementations:

Resources

Core Resources (Use Today)

Bonus Resources (If Extra Time)

Guided Practice

Activity 1: Vector Clock Implementation (35 min)

Understanding causality in distributed systems

Step 1: Basic vector clock (15 min)

class VectorClock:
    def __init__(self, process_id, num_processes):
        self.process_id = process_id
        self.clock = [0] * num_processes

    def tick(self):
        # Local event: increment own counter
        self.clock[self.process_id] += 1

    def send_message(self):
        # Prepare timestamp for outgoing message
        self.tick()
        return self.clock.copy()

    def receive_message(self, msg_timestamp):
        # Update clock based on received message
        for i in range(len(self.clock)):
            if i == self.process_id:
                self.clock[i] += 1
            else:
                self.clock[i] = max(self.clock[i], msg_timestamp[i])

Step 2: Causality detection (10 min)

Implement happens-before relation and test with concurrent vs causal events:

def happens_before(self, other_timestamp):
    # Returns True if self happened before other
    pass

def concurrent_with(self, other_timestamp):
    # Returns True if events are concurrent
    pass

Step 3: Multi-process simulation (10 min)

Activity 2: Distributed Mutual Exclusion (30 min)

Ricart-Agrawala algorithm implementation

Step 1: Algorithm basics (10 min)

Step 2: Implementation (15 min)

class DistributedMutex:
    def __init__(self, process_id, all_processes):
        self.process_id = process_id
        self.all_processes = all_processes
        self.logical_clock = 0
        self.state = "RELEASED"  # RELEASED, WANTED, HELD
        self.request_queue = []
        self.replies_received = 0

    def request_critical_section(self):
        self.state = "WANTED"
        self.logical_clock += 1
        request_timestamp = self.logical_clock
        # Send REQUEST(timestamp, process_id) to all processes
        for process in self.all_processes:
            if process != self.process_id:
                self.send_request(process, request_timestamp)

    def handle_request(self, sender_id, timestamp):
        # Respond immediately or defer based on priority
        pass

Step 3: Priority resolution (5 min)

Activity 3: Simplified Paxos Simulation (25 min)

Basic consensus scenario

Step 1: Setup (5 min)

Step 2: Phase 1 - Prepare (10 min)

Step 3: Phase 2 - Accept (10 min)

Creativity - Ink Drawing

Time: 25 minutes Focus: Abstract concepts and flow visualization

Today's Challenge: Network Diagrams

  1. Distributed system visualization (15 min)

    • 5 nodes in a network topology
    • Message flows between nodes
    • Different line weights for different message types
    • Clock symbols showing time progression
  2. Abstract representation (10 min)

    • Causality chains as flowing curves
    • Consensus as converging lines
    • Time as layered/stepped progression

Advanced Techniques

Key Insights

  1. Time is relative in distributed systems - There's no global "now"; only causality (happens-before) matters
  2. Lamport timestamps capture partial order - If A→B then TS(A)<TS(B), but not vice versa
  3. Vector clocks detect concurrency - Can determine if events are causally related or concurrent
  4. Logical clocks prevent bugs - Don't use system timestamps for ordering in distributed systems
  5. Consensus is hard but solvable - Paxos/Raft enable agreement despite failures
  6. Real systems use these primitives - Cassandra, Git, Spanner all build on these concepts

Reflection Questions

  1. Why can't we rely on synchronized physical clocks (even with NTP) for ordering events in distributed systems?
  2. What's the difference between "A happened before B" and "A's timestamp is less than B's timestamp" in distributed systems?
  3. How do vector clocks improve upon Lamport timestamps? When would you choose one over the other?
  4. In what situations would concurrent events (detected by vector clocks) cause problems in a distributed database?
  5. Why does the Ricart-Agrawala algorithm use Lamport timestamps instead of just process IDs for priority?
  6. How does Paxos handle the situation where two proposers send conflicting proposals simultaneously?

Daily Deliverables

Advanced Synthesis

Today's core question: "How do distributed systems achieve coordination without shared memory or global clocks?"

Key insights to capture:

Conceptual Bridges

Connect today's concepts:

  1. Vector clocksLamport timestampsDatabase transaction ordering
  2. Distributed mutexLeader electionDatabase primary selection
  3. Paxos consensusRaft consensusDatabase replication protocols

Total Estimated Time (OPTIMIZED)

Note: Focus on understanding causality and logical time. Simplified examples are perfect.

Common Challenges

Preparation for Tomorrow

Tomorrow's bridge topics:

Success Criteria

Minimum (Basic Understanding):

Target (Solid Grasp):

Excellent (Deep Mastery):

Bonus Research

Optional evening reading:



← Back to Learning