Day 038: Block, File, and Object Storage

These are not three brands of storage. They are three different contracts about how applications are allowed to name, mutate, and coordinate durable data.

Today's "Aha!" Moment

When teams compare storage options, they often talk as if they were shopping among equivalent products: should we use a volume, a shared file system, or object storage? That framing is weak. The real question is what kind of relationship the application wants with its data. Does it want raw mutable space and full control over layout? Does it want a shared hierarchical namespace with file operations? Or does it want durable blobs addressed by key, usually read or replaced as whole units?

The learning platform gives a clean example. Its relational database wants low-level mutable storage for pages, indexes, and WAL files. Shared report exports and some tooling want path-based access with directories and file semantics. Lesson videos, thumbnails, and backups mostly want durable blobs addressed by key, often written once and read many times. Those workloads are all "storage," but they are asking for different promises.

That is the key idea: block, file, and object storage differ less in "speed" than in semantics. They differ in what the client is allowed to do, what metadata model is exposed, who owns naming, and how much coordination the system is trying to provide. Once you see that, the choice becomes much clearer. You stop asking which storage type is modern and start asking which contract matches the shape of the data.

This matters because wrong storage choices create friction everywhere else. A database forced onto the wrong abstraction loses control it depends on. Shared application code becomes awkward if all it gets are opaque objects. Massive media archives become harder to operate if they are treated like mutable shared files when they mostly behave like immutable blobs. Storage choice is therefore really interface choice.

Why This Matters

The problem: Storage conversations often collapse into vendor defaults or product popularity, which hides the fact that different workloads need fundamentally different storage contracts.

Before:

Durable data is treated as one generic category.
Teams compare products without first naming mutability, naming model, or sharing needs.
Operational pain appears later because the application is fighting the storage semantics.

After:

Storage is chosen according to the interface and coordination model the workload actually needs.
The team can explain why a database, shared directory, and media archive should not all sit on the same abstraction.
Trade-offs around mutation, namespace, durability, and scale are made explicit early.

Real-world impact: Better storage choices reduce operational complexity, improve performance for the right access patterns, and avoid forcing applications to simulate missing semantics badly in user space.

Learning Objectives

By the end of this session, you will be able to:

Distinguish the three storage contracts clearly - Explain block, file, and object storage by the semantics they expose, not just by example products.
Match workloads to the right abstraction - Decide which contract fits a database volume, shared file tree, media blob store, or backup archive.
Reason about trade-offs explicitly - Evaluate control, naming, mutation, coordination, and scale without reducing the discussion to vendor comparison.

Core Concepts Explained

Concept 1: Block Storage Gives Raw Mutable Space and Pushes Structure Upward

Suppose the learning platform's database needs durable space for its pages, indexes, and write-ahead log. It does not want the storage layer deciding what a "file" or "directory" means. It wants addressable storage it can organize itself.

That is what block storage provides. The client sees a sequence of addressable blocks or sectors and decides what those bytes mean. A file system can sit on top of block storage. A database can sit on top of a file system or directly on a block device and still manage its own page layout, caching, and durability logic. The point is that the higher-level meaning lives above the block layer.

application / database
    -> block device
    -> numbered offsets
    -> durable bytes

This is powerful because it gives the application or file system tight control over mutation and layout. It is also demanding because naming, sharing, and coordination are not built in. If multiple clients need the same logical namespace, block storage alone does not solve that problem for you.

The trade-off is control versus convenience. You gain freedom to implement your own structure and write patterns, but you give up higher-level semantics like shared naming, directory hierarchy, and simpler multi-client coordination.

Concept 2: File Storage Adds a Shared Namespace and File-System Semantics

Now consider shared report exports and tooling artifacts. Multiple programs may want to open /exports/monthly/summary.csv, move files between directories, manage permissions, or append to logs in ways that fit a hierarchical namespace. Here the main value is not raw blocks. It is the file-system contract.

File storage adds exactly that. It gives the client names, paths, directories, permissions, and file operations on top of underlying storage. Somewhere underneath, the file system still maps metadata to blocks or extents, but the application no longer has to think at that level.

path
 -> metadata
 -> underlying blocks/extents
 -> bytes

This is why file storage is often the right abstraction when humans or applications genuinely think in terms of files and folders, or when existing software expects POSIX-like semantics. But those semantics come at a cost. Shared namespaces, locks, metadata coordination, and path operations are work. They are useful only when the workload benefits from them.

The trade-off is usability versus coordination complexity. You gain familiar file operations and shared hierarchy, but you also inherit metadata and consistency overhead that simpler blob-style workloads may not need.

Concept 3: Object Storage Simplifies the Contract by Treating Data as Named Blobs

Now shift to lesson videos, thumbnails, archives, and backups. These artifacts are often written once, versioned or replaced as a whole, and read by key or URL rather than edited in place by many clients. For this kind of workload, a full shared file-system contract is often unnecessary baggage.

Object storage simplifies the model. Clients PUT and GET whole objects identified by keys, along with associated metadata. The system typically does not promise traditional file locking, in-place mutation, or a rich hierarchical namespace in the file-system sense. In return, it often scales more naturally for large blobs, replication, and broad distribution.

object key
   -> object metadata
   -> whole-object read/write
   -> durable replicated blob

That does not make object storage "better." It makes it better aligned with workloads that behave like content, media, backups, logs, or generated artifacts. If the workload needs frequent fine-grained mutation or file-style coordination, forcing it onto object storage usually means rebuilding missing semantics awkwardly in the application.

One simple decision sketch is:

def choose_storage(workload):
    if workload["needs_low_level_mutation"]:
        return "block"
    if workload["needs_shared_paths_and_file_ops"]:
        return "file"
    return "object"

The code is intentionally rough, but it captures the main point: choose the contract that matches the workload's semantics, not the product that happens to be fashionable.

The trade-off is simplicity and scale versus richer mutation semantics. Object storage reduces coordination needs for the right workloads, but it is a poor fit if the application really wants file-system behavior.

Troubleshooting

Issue: The storage discussion starts with products instead of workload semantics.

Why it happens / is confusing: Product names are concrete and familiar, while semantics like mutation pattern and coordination needs feel more abstract.

Clarification / Fix: Start with four questions: how is the data named, how is it mutated, who shares it, and what coordination semantics are required? Only then compare concrete products.

Issue: Object storage is treated as the default "modern" answer.

Why it happens / is confusing: Object stores are operationally attractive and widely used, so their missing semantics can disappear from the discussion.

Clarification / Fix: Check whether the application truly works with whole-object reads and writes. If it needs path-based sharing or fine-grained mutation, object storage may be the wrong contract even if the platform is easy to operate.

Advanced Connections

Connection 1: Databases ↔ Block Storage

The parallel: Database engines often want lower-level control because they already implement their own page layout, caching, WAL, and recovery semantics above the storage substrate.

Real-world case: A database may live on files backed by block devices, but the database still treats the underlying storage as a mutable byte-addressable substrate, not as a hierarchy of meaningful business objects.

Connection 2: Media Pipelines ↔ Object Storage

The parallel: Media, backups, and archives often behave like whole immutable or replaceable blobs, which lines up naturally with object semantics and broad distribution.

Real-world case: Video libraries are much easier to operate when stored as named objects behind CDN and lifecycle policies than when treated as mutable shared files.

Resources

Optional Deepening Resources

These resources are optional and are not required for the core 30-minute path.
[DOC] Amazon S3 FAQs
- Link: https://aws.amazon.com/s3/faqs/
- Focus: Notice which semantics S3 emphasizes and which traditional file-system behaviors it does not try to provide.
[DOC] Ceph Documentation
- Link: https://docs.ceph.com/en/latest/
- Focus: Explore one platform that exposes block, file, and object interfaces from a common distributed substrate.
[BOOK] Designing Data-Intensive Applications
- Link: https://dataintensive.net/
- Focus: Revisit how storage interfaces and system semantics shape application architecture.

Key Insights

These are three storage contracts, not three equivalent products - The main difference is the semantics they expose around naming, mutation, and coordination.
The workload should choose the abstraction - Databases, shared file trees, and media blobs usually want different storage contracts for good reasons.
Operational simplicity comes from semantic fit - Systems are easier to run when the application is not fighting the storage model underneath it.

Knowledge Check (Test Questions)

What does block storage primarily provide?
- A) A shared directory hierarchy with filenames and permissions.
- B) Raw addressable durable space whose higher-level meaning is defined by layers above it.
- C) Whole-object APIs with keys and blob metadata.
Why is file storage often the right fit for shared exports and path-oriented tools?
- A) Because it provides a hierarchical namespace and file operations that the applications already expect.
- B) Because it removes all metadata overhead.
- C) Because it behaves exactly like object storage with prettier names.
Why is object storage often attractive for media and backup archives?
- A) Because those workloads usually tolerate whole-object reads/writes and benefit from simpler blob-oriented semantics.
- B) Because object storage is the only way to get durability.
- C) Because object storage offers the best fit for in-place page updates.

Answers

1. B: Block storage gives low-level mutable space. Naming and richer semantics are provided by systems above it.

2. A: File storage is valuable when applications and users genuinely need paths, directories, and file-style sharing semantics.

3. A: Media and backups often behave like durable blobs rather than collaboratively edited files, so object semantics usually fit them well.

← Back to Learning