Data Models: Relational, Document, and Graph

LESSON

Data Architecture and Platforms

009 30 min advanced

Day 457: Data Models: Relational, Document, and Graph

The core idea: A data model is an operational contract. It determines which facts can be updated together, where invariants can be enforced cheaply, and what kinds of questions stay tractable once the system is under real load.

Today's "Aha!" Moment

In Reliability, Scalability, and Maintainability Trade-offs, PayLedger learned that architecture is really a choice about where to spend coordination, latency, and operational complexity. This lesson makes that choice more concrete. The same payroll approval can be represented as a row in a relational table, an embedded object inside a payroll-run document, or a set of nodes and edges in a graph. Those are not cosmetic differences. Each representation changes which guarantees are natural, which queries are cheap, and which failures are likely.

That is the main insight: data models should be chosen by update semantics and access paths, not by the shape of the JSON returned to the frontend. If a fact must satisfy uniqueness, foreign-key integrity, and transactional coupling with settlement, the relational model starts with an advantage because those guarantees are built into its core machinery. If a record is naturally read and written as one aggregate with lots of optional nested structure, a document model may fit better because locality is the default. If the critical product question is "how are these entities connected across several hops?", a graph model may turn a painful series of joins into the native operation.

PayLedger makes this visible. The company stores payroll runs, approvals, settlement batches, country-specific tax settings, delegated approver chains, and legal-entity ownership relationships. Trying to force all of that into one mental model creates friction somewhere. The canonical approval path wants strong invariants. Employer configuration snapshots want nested locality and flexible evolution. Compliance investigations want fast traversal across relationships that are awkward to flatten.

The misconception worth dropping is that relational, document, and graph systems are arranged on a maturity ladder. They are not "old," "modern," and "specialized" versions of the same thing. They are different answers to one production question: what structure lets us preserve truth while answering the questions the product actually asks? The next lesson, Declarative Queries and Execution Thinking, will build directly on this by showing how those model choices shape the query planner and execution engine.

Why This Matters

PayLedger runs payroll approval and settlement workflows for multinational employers. A payroll manager submits a run, one or more approvers sign off, downstream services calculate taxes and benefits, and a settlement pipeline schedules money movement. Meanwhile, support engineers need to explain why a run is blocked, compliance analysts need to trace delegated authority across subsidiaries, and the product team wants flexible employer-specific configuration for imports, cutoffs, and local rules.

If the team stores everything in a relational core without thinking about aggregate boundaries, some read paths may become join-heavy and schema changes for configurable employer settings may feel slower than they need to be. If the team stores everything as documents because the API returns nested payloads, it may discover too late that duplicate approvals, dangling references, and cross-record consistency checks are now application bugs instead of database-enforced constraints. If the team reaches for a graph model everywhere because relationships exist in the business domain, it may end up making routine ledger-style writes and reporting queries harder than necessary.

Production systems fail at the seams between write semantics and query semantics. A model that looks elegant during local development can become expensive once quarter-end approvals arrive concurrently, an auditor asks for a historical explanation, or a recovery drill has to prove which data is authoritative. The point of this lesson is not to crown a winner. It is to make the fit criteria explicit enough that PayLedger can place the right facts in the right structure and explain why.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain how each model represents truth - Describe how relational, document, and graph systems organize records, relationships, and update boundaries.
  2. Compare model fit using mechanism instead of fashion - Evaluate each model by invariants, query shape, schema evolution, and operational cost.
  3. Choose a concrete model for a production scenario - Defend where PayLedger should use relational structure, where document structure helps, and where graph traversal earns its complexity.

Core Concepts Explained

Concept 1: Relational models make invariants first-class

For the canonical payroll approval path, PayLedger needs a source of truth that can say simple but high-stakes things with precision: one approver should not approve the same run twice, every approval must reference an existing payroll run, settlement cannot start until the approval state is valid, and audit history must survive retries and crashes. The relational model is strong here because tables, keys, constraints, and transactions are designed to express exactly those relationships.

In a relational representation, the business entities are decomposed into normalized tables such as payroll_runs, approvals, settlement_batches, and approver_assignments. Rows carry identifiers. Foreign keys bind one fact to another. Unique constraints prevent illegal duplication. Transactions group writes that must succeed or fail together. If PayLedger commits an approval row and an outbox record in the same transaction, the system has a crisp answer to "was this approval accepted?" because the database enforces the boundary.

The mechanism matters more than the syntax. When a user clicks Approve, the database can acquire row-level locks, verify that the payroll run is still in an approvable state, insert the approval, update aggregate status, and persist an event for downstream work before acknowledging success. That is why relational systems remain the default for money-moving systems and authoritative ledgers: correctness rules live near the data, not only in application code.

CREATE TABLE payroll_runs (
    run_id UUID PRIMARY KEY,
    employer_id UUID NOT NULL,
    status TEXT NOT NULL CHECK (status IN ('draft', 'pending_approval', 'approved', 'settled'))
);

CREATE TABLE approvals (
    run_id UUID NOT NULL REFERENCES payroll_runs(run_id),
    approver_id UUID NOT NULL,
    approved_at TIMESTAMPTZ NOT NULL,
    PRIMARY KEY (run_id, approver_id)
);

The trade-off is that relational systems make you pay attention to structure early. Highly variable nested data often has to be broken into child tables or stored in carefully bounded JSON columns. Query performance depends on indexes and execution plans rather than naive object navigation. That cost is usually worth it when the business fact is authoritative and financially sensitive. For PayLedger, the approval ledger, settlement eligibility, and replayable audit trail belong here because violating those invariants is far more expensive than writing a few joins.

Concept 2: Document models optimize for aggregate locality and flexible shape

Not every important fact in PayLedger behaves like a ledger entry. Employer-specific payroll configuration is a good counterexample. One employer may use a simple salaried workflow, another may have nested overtime rules by country, multiple import mappings, optional bonus policies, local cutoff calendars, and region-specific notification settings. Those settings are often loaded together, updated together, and displayed together. Splitting them across many relational tables can be correct, but it may add friction without buying much.

The document model treats a record as an aggregate that owns its nested structure. A single employer_payroll_config document can contain country rules, import mappings, approval thresholds, and feature flags. Reads stay local because the application usually fetches the entire configuration by employer. Schema evolution is easier because optional fields and nested sections can appear without a global migration every time the product adds another integration switch or a new regional policy block.

{
  "employer_id": "acme-eu",
  "default_currency": "EUR",
  "country_rules": {
    "ES": {"cutoff_day": 25, "requires_union_report": true},
    "DE": {"cutoff_day": 26, "tax_engine": "v3"}
  },
  "approval_policy": {
    "min_approvers": 2,
    "delegation_window_days": 14
  },
  "import_mappings": [
    {"source": "workday", "employee_id_field": "workerId"},
    {"source": "csv", "employee_id_field": "emp_code"}
  ]
}

The production advantage is locality. If one request almost always needs the whole aggregate, the document model avoids reconstructing it from many joins. Developers can evolve the aggregate shape with less ceremony, and indexing can still support selective lookups on commonly filtered fields. This is especially useful for configuration snapshots, product content, or profile-like records where nesting reflects a genuine ownership boundary.

The trade-off appears when teams confuse "nested on screen" with "should live in one document." If PayLedger embeds the full approval history for a payroll run inside a single payroll-run document, concurrent approvers now contend on the same large record, uniqueness constraints become harder to enforce, document growth becomes operationally relevant, and cross-document reporting gets more expensive. Documents are best when the aggregate really is the unit of change. They are weaker when many independent actors update small parts concurrently or when the business truth spans many records with hard integrity rules.

Concept 3: Graph models make relationship traversal the native operation

Some of PayLedger's hardest questions are not about one record or one aggregate. They are about paths. Which managers can approve payroll for a subsidiary because authority was delegated through a parent company? Which legal entities share a tax processor, and which runs are affected if that processor is suspended? Which support tickets, payroll runs, and approvers are connected to a compliance incident that started in one region but propagated through shared ownership? These questions are relationship-first, not row-first.

Graph models represent entities as nodes and relationships as edges with their own labels and properties. Instead of precomputing every join path or flattening relationships into lookup tables, the database can traverse from one node to another across several hops. In a graph, PayLedger can model EMPLOYER, LEGAL_ENTITY, APPROVER, and PAYROLL_RUN as nodes, then connect them with edges such as OWNS, DELEGATED_TO, APPROVES, and USES_PROVIDER.

(ParentCo)-[:OWNS]->(Subsidiary)
   |
   +-[:DELEGATED_TO {expires_on: 2026-05-01}]->(RegionalFinanceLead)
                                                |
                                                +-[:APPROVES]->(PayrollRun-2026-04-EU)

This becomes powerful when the question is exploratory or multi-hop by nature. A graph query can ask for all currently valid delegation paths from a parent company to a payroll run, filter by expiration date, and return the shortest explanation chain. In a relational system, the same question may still be possible, but recursive joins or closure tables become the central complexity. The graph model reduces that impedance because pathfinding is the abstraction, not an awkward extension.

The trade-off is that graph systems are usually a poor home for high-volume transactional ledgers. Bulk updates, tabular analytics, and strict multi-record invariants are not where they shine. For PayLedger, the graph is compelling for authorization lineage, ownership analysis, and incident investigation, but not as the canonical home of approvals and settlements. Graph models earn their place when traversals are the product, the investigation workflow, or the operational control plane, not when teams merely want a more interesting way to store ordinary rows.

Troubleshooting

Issue: The team wants a document database because the API returns nested JSON payloads.

Why it happens / is confusing: API shape is visible, so it feels like the "natural" source of truth. But response formatting can be derived. Invariant enforcement cannot.

Clarification / Fix: Start from the write boundary. If the data must prevent duplicate approvals, preserve referential integrity, and coordinate with settlement, keep the canonical record in a relational structure even if the read API later assembles nested JSON.

Issue: A relational schema has turned into many ad hoc JSON columns because employer settings vary across countries.

Why it happens / is confusing: The team correctly senses that some parts of the domain are too heterogeneous for rigid normalization, but it solves the pressure locally instead of naming separate aggregate types.

Clarification / Fix: Separate invariant-bearing facts from configuration aggregates. Keep approvals and settlement state relational. Move employer configuration snapshots into a bounded document-style representation, or at least design them as explicit aggregates instead of uncontrolled overflow blobs.

Issue: Engineers propose a graph database for the whole platform after struggling with recursive joins for approval delegation.

Why it happens / is confusing: One painful traversal can make the entire relational model look inadequate, especially when the relationship query is real and urgent.

Clarification / Fix: Scope the graph to the questions that are actually relationship-centric. Use it when multi-hop lineage, delegation, or dependency analysis is the core job. Do not move the ledger just because one part of the domain wants traversal semantics.

Advanced Connections

Connection 1: Data models ↔ query execution

This lesson sets up Declarative Queries and Execution Thinking. The same business question compiles into different execution machinery depending on the model: a relational planner chooses scans, joins, and indexes; a document engine may use aggregate pipelines and multikey indexes; a graph engine expands traversal frontiers along edges. Choosing a model is partly choosing what the execution engine will consider cheap.

Connection 2: Data models ↔ systems of record

The model that stores authoritative truth should match the strongest invariants in the workflow. Derived systems can then reshape that truth for other use cases. PayLedger might keep approvals relational, publish employer configuration snapshots as documents for fast product reads, and maintain a graph projection for compliance investigations. That is not premature polyglot persistence if each representation has a clearly bounded authority and replay path.

Resources

Optional Deepening Resources

Key Insights

  1. Relational models are strongest where invariants are expensive to violate - Keys, constraints, and transactions make them a natural home for authoritative business facts such as payroll approvals and settlement eligibility.
  2. Document models work best when one aggregate owns a nested shape - They reduce friction for records that are read and updated as a unit, but they become awkward when many writers or cross-record rules dominate.
  3. Graph models earn their keep when paths are the product question - They simplify lineage, delegation, and dependency queries that are otherwise artificial in row-oriented storage.
PREVIOUS Reliability, Scalability, and Maintainability Trade-offs NEXT Declarative Queries and Execution Thinking

← Back to Data Architecture and Platforms

← Back to Learning Hub