Software Engineer · Java · Spring Boot · Microservices
Designing a File Storage System at Scale: S3-Style Architecture, Chunking & Geo-Replication
Amazon S3 stores trillions of objects and delivers billions of requests per day with eleven nines of durability. How do you architect something at that scale? This system design deep-dive walks through every layer of a production-grade object storage system: requirements analysis, data model, chunking and multipart upload, storage node architecture with erasure coding, distributed metadata management, cross-region geo-replication, and CDN-integrated pre-signed URL generation. Whether you're preparing for a system design interview or building internal storage infrastructure, this guide covers the decisions that matter.
Table of Contents
1. Requirements and Scale Estimation
Every great system design begins with a thorough requirements discussion. For a distributed file storage system, we must separately articulate functional requirements (what the system does) and non-functional requirements (the quality attributes that constrain the design).
Functional Requirements:
- Upload files of any type, from a few bytes up to 5 TB per object.
- Download files by bucket name and object key, with optional byte-range requests for partial content.
- Delete objects and entire buckets.
- List objects within a bucket with prefix filtering and delimiter-based hierarchical navigation.
- Object versioning: retain previous versions of an object rather than overwriting on update.
- Access control: bucket-level policies and per-object ACLs. Pre-signed URLs for time-limited, unauthenticated access.
- Metadata support: custom user-defined key-value pairs per object (e.g.,
Content-Type,x-amz-meta-author).
Non-Functional Requirements:
- Durability: 99.999999999% (11 nines). Once data is written and acknowledged, it must never be lost.
- Availability: 99.99% (52 minutes downtime per year). Reads must be highly available even during regional degradation.
- Consistency: Strong read-after-write consistency for new object uploads. Eventual consistency for cross-region replication.
- Latency: p99 < 200ms for small (<1MB) object downloads via CDN. Throughput-optimized for large objects.
- Scalability: Horizontally scalable to exabytes of storage and millions of QPS.
Scale Estimation: Assume 1 billion registered users, 10 million daily active users (DAU). Each DAU performs an average of 10 file uploads per day with an average file size of 1 MB (mix of small profile images, documents, and occasional large media files).
- Write QPS: 10M users × 10 uploads / 86,400 seconds = ~1,157 write QPS peak (3× = ~3,500 QPS with traffic spikes).
- Read QPS: Reads are typically 10× writes = ~11,570 read QPS (~35,000 QPS peak).
- Daily ingress: 10M DAU × 10 uploads × 1 MB = 100 TB/day.
- 10-year storage: 100 TB/day × 365 days × 10 years = 365 PB of raw data. With 3× replication or equivalent erasure coding overhead: ~1.1 EB total physical storage.
- Metadata: Assuming 500 bytes of metadata per object, 10 billion objects after 10 years = 5 TB of metadata. This must fit in a fast, indexed store.
2. High-Level Architecture
A file storage system separates cleanly into a control plane (metadata, identity, policy, billing) and a data plane (actual byte storage and retrieval). This separation mirrors AWS's internal S3 architecture and allows each plane to scale independently.
Control Plane Components:
- API Gateway / Load Balancer: Terminates TLS, routes requests, enforces rate limits. Routes
PUT/POSTto upload handlers,GETto download handlers, with separate endpoints for multipart upload lifecycle management. - Auth & IAM Service: Validates access keys, session tokens, and pre-signed URL signatures. Evaluates bucket policies and ACLs. Issues short-lived credentials for cross-service access.
- Metadata Service: Stores and indexes object metadata (key, bucket, size, ETag, content-type, version ID, timestamps, ACL). Backed by PostgreSQL for ACID consistency and Cassandra for high-throughput key lookups at scale.
- Upload Coordinator: Manages multipart upload state — tracks which parts have been uploaded, their checksums, and the final assembly instruction. Backed by Redis for fast part tracking and a persistent store for durability.
Data Plane Components:
- Storage Nodes: Commodity servers with large HDDs/SSDs that store raw object chunks. Organized into placement groups using consistent hashing. Each node exposes a simple internal HTTP API for chunk put/get/delete.
- Replication Manager: Ensures each chunk achieves the required replication factor (or erasure coding parity). Asynchronously monitors and repairs under-replicated chunks due to node failures.
- Message Queue (Kafka): Decouples async operations — replication events, deletion propagation, cross-region sync events, audit log writes — from the synchronous request path.
Request Flow for a PUT (upload): Client → API Gateway → Auth Service (validates credentials, checks bucket ACL) → Upload Handler → splits large files into chunks → writes chunks to primary storage nodes → persists chunk location map to Metadata Service → publishes replication event to Kafka → Replication Manager asynchronously propagates chunks to secondary nodes → returns 200 OK with ETag once primary write is durable.
3. Object Storage Data Model
Object storage organizes data into buckets (named namespaces owned by an account) and objects (the actual files, identified by a unique key within a bucket). The key is a flat string — there are no true directories. The appearance of hierarchical navigation (folders) is achieved by using / as a conventional delimiter in key names and filtering by prefix, which the API maps to a tree-like listing.
Each object has an associated ETag — an MD5 hash (or for multipart uploads, an MD5 of the concatenated part MD5s, plus a -N suffix indicating N parts). The ETag acts as a content fingerprint for cache validation (If-None-Match conditional requests) and data integrity verification. Custom user metadata is stored as key-value pairs with the x-amz-meta- prefix convention.
# REST API surface — object operations
PUT /buckets/{bucket}/objects/{key} # upload single object (≤5GB)
GET /buckets/{bucket}/objects/{key} # download object (supports Range header)
HEAD /buckets/{bucket}/objects/{key} # fetch metadata without body
DELETE /buckets/{bucket}/objects/{key} # soft-delete (version marker)
# Query parameters
GET /buckets/{bucket}/objects/{key}?versionId=abc123 # specific version
GET /buckets/{bucket}/objects?prefix=images/&delimiter=/&maxKeys=100 # list
# Multipart upload lifecycle
POST /buckets/{bucket}/objects/{key}?uploads # initiate: returns uploadId
PUT /buckets/{bucket}/objects/{key}?partNumber=N&uploadId=X # upload part N
GET /buckets/{bucket}/objects/{key}?uploadId=X # list uploaded parts
POST /buckets/{bucket}/objects/{key}?uploadId=X # complete multipart
DELETE /buckets/{bucket}/objects/{key}?uploadId=X # abort multipart
# Example: flat namespace with "/" delimiter creating virtual hierarchy
# Keys: "images/2026/01/photo.jpg", "images/2026/02/video.mp4", "docs/readme.txt"
# GET /buckets/my-bucket/objects?prefix=images/&delimiter=/
# Returns: CommonPrefixes: ["images/2026/"], Objects: []
# → simulates a "folder" listing without actual directory inodes
Object versioning, when enabled on a bucket, preserves every version of an object under a system-generated version ID. A DELETE without a version ID creates a delete marker — a zero-byte object that appears as the current version, making the object invisible to ordinary GET requests without physically removing the underlying data. This enables soft-delete, point-in-time recovery, and protection against accidental overwrites.
4. Chunking and Multipart Upload
A 5 TB file cannot be uploaded as a single HTTP request. Network interruptions are inevitable over long transfers, and a single-byte corruption or TCP reset requires restarting the entire upload from scratch. Chunking — splitting a large file into smaller parts and uploading them independently — solves three problems simultaneously: resumability (only failed parts need to be retried), parallelism (multiple parts upload concurrently, saturating available bandwidth), and deduplication (chunks with identical checksums can be stored once across multiple objects, enabling content-addressable storage).
The multipart upload protocol has three phases: Initiate (get an upload session ID), Upload Parts (PUT each part independently, receive an ETag per part), and Complete (submit the ordered list of part ETags to assemble the final object). Parts must be at least 5 MB (except the last part), and there can be up to 10,000 parts — giving a maximum object size of 10,000 × 5 GB = 50 TB (though S3 caps at 5 TB per the console). The optimal chunk size depends on the file size, available bandwidth, and desired parallelism level.
// Java multipart upload: split file, compute checksums, upload parts in parallel
public class MultipartUploader {
private static final int PART_SIZE_BYTES = 8 * 1024 * 1024; // 8 MB parts
private final StorageClient storageClient;
private final ExecutorService executor = Executors.newFixedThreadPool(8);
public String uploadLargeFile(String bucket, String key, Path filePath) throws Exception {
long fileSize = Files.size(filePath);
// Phase 1: Initiate multipart upload
String uploadId = storageClient.initiateMultipartUpload(bucket, key);
List<CompletedPart> completedParts = new CopyOnWriteArrayList<>();
List<Future<?>> futures = new ArrayList<>();
int partNumber = 1;
long offset = 0;
// Phase 2: Upload parts in parallel
while (offset < fileSize) {
final long partOffset = offset;
final long partLength = Math.min(PART_SIZE_BYTES, fileSize - offset);
final int part = partNumber;
futures.add(executor.submit(() -> {
try {
byte[] data = readChunk(filePath, partOffset, partLength);
String checksum = computeChecksum(data); // SHA-256 per part
String md5 = computeMd5(data); // MD5 for ETag
String etag = storageClient.uploadPart(
bucket, key, uploadId, part, data, md5
);
completedParts.add(new CompletedPart(part, etag, checksum));
} catch (IOException e) {
throw new RuntimeException("Part " + part + " upload failed", e);
}
}));
offset += partLength;
partNumber++;
}
// Wait for all parts to complete
for (Future<?> f : futures) f.get();
// Phase 3: Complete multipart upload — server assembles the object
completedParts.sort(Comparator.comparingInt(CompletedPart::getPartNumber));
return storageClient.completeMultipartUpload(bucket, key, uploadId, completedParts);
}
private byte[] readChunk(Path path, long offset, long length) throws IOException {
try (SeekableByteChannel channel = Files.newByteChannel(path, StandardOpenOption.READ)) {
channel.position(offset);
ByteBuffer buffer = ByteBuffer.allocate((int) length);
while (buffer.hasRemaining()) channel.read(buffer);
return buffer.array();
}
}
private String computeChecksum(byte[] data) throws NoSuchAlgorithmException {
MessageDigest sha256 = MessageDigest.getInstance("SHA-256");
return Base64.getEncoder().encodeToString(sha256.digest(data));
}
private String computeMd5(byte[] data) throws NoSuchAlgorithmException {
MessageDigest md5 = MessageDigest.getInstance("MD5");
return HexFormat.of().formatHex(md5.digest(data));
}
}
On the server side, the upload coordinator stores each part reference (part number, storage node locations, chunk checksums) in Redis for fast access during the active upload window. Once the Complete request arrives with the ordered part ETag list, the coordinator validates checksums, writes a logical object manifest to the metadata store (mapping the object key to its ordered list of chunk locations), and atomically commits the object as visible. Incomplete multipart uploads — parts uploaded but never completed — are cleaned up by a background lifecycle management job after a configurable expiry period to prevent orphan chunk accumulation.
5. Storage Node Architecture
Storage nodes are the workhorses of the data plane — commodity servers equipped with large-capacity HDDs (for throughput-optimized cold storage) or NVMe SSDs (for latency-sensitive hot storage). Each node runs a simple chunk storage daemon that accepts PUT, GET, and DELETE operations for individual chunks identified by their content hash. The daemon manages local disk allocation using a log-structured storage format to minimize seek overhead on spinning disks.
Fault tolerance: the system must survive the simultaneous failure of multiple storage nodes without data loss. Two approaches are used at different cost-durability trade-offs:
3-way replication writes each chunk to three independent storage nodes in different failure domains (different racks, different availability zones). This provides 3× storage overhead but trivial read availability (any one of three nodes can serve the read) and fast reconstruction (copy from a surviving replica).
Erasure coding — specifically Reed-Solomon (RS) codes — is used for cost-efficient large-scale storage. In an RS(6,3) configuration, each chunk is split into 6 data shards and 3 parity shards, totaling 9 shards stored across 9 nodes. The system can reconstruct the original chunk from any 6 of the 9 shards, surviving 3 simultaneous node failures. The storage overhead is only 1.5× (9 shards for 6 data shards' worth of content) compared to 3× for replication — a 50% storage cost saving at petabyte scale.
# Storage node placement via consistent hashing ring
# Virtual nodes (vnodes) ensure even distribution across physical nodes
class StoragePlacementService:
"""
Consistent hashing ring with virtual nodes.
Maps chunk_id → list of storage_node_ids for placement.
"""
REPLICATION_FACTOR = 3
VNODES_PER_NODE = 150 # virtual nodes per physical server
def get_placement_nodes(self, chunk_id: str) -> list[str]:
"""
Returns REPLICATION_FACTOR distinct physical nodes for a chunk.
Nodes are selected from distinct racks/AZs for fault isolation.
"""
ring_position = self._hash(chunk_id)
candidates = self._walk_ring_clockwise(ring_position)
return self._select_fault_isolated(candidates, self.REPLICATION_FACTOR)
def _select_fault_isolated(self, candidates, n):
"""Ensure selected nodes span different racks/AZs."""
selected = []
seen_racks = set()
for node in candidates:
rack = self.node_metadata[node]['rack']
if rack not in seen_racks:
selected.append(node)
seen_racks.add(rack)
if len(selected) == n:
break
return selected
# Hot/Cold storage tiering
# Hot tier: NVMe SSDs, 3-way replication, <10ms p99 read latency
# Warm tier: SATA SSDs, RS(6,3) erasure coding, <50ms p99 read latency
# Cold tier: HDD + tape, RS(10,4) erasure coding, hours retrieval time
# Objects move between tiers based on last-access time and access frequency.
The trade-off is reconstruction cost: rebuilding a failed erasure-coded chunk requires reading 6 surviving shards across the network, vs. reading a single replica in the replication case. This makes erasure coding CPU and network intensive during node recovery — but for cold storage where reconstruction is rare and storage cost dominates, the trade-off is clearly worthwhile. Facebook's f4 system documented 50% storage savings by moving cold Blob data from 3× replication to RS(14,10) erasure coding.
6. Metadata Management
Metadata is the brain of the object storage system. For every object, we must store and efficiently query: bucket name, object key, version ID, ETag (content hash), content type, content length (bytes), storage class, owner account ID, ACL settings, replication status, custom user metadata, and timestamps (created-at, last-modified, last-accessed). This amounts to roughly 300–500 bytes per object — small individually but enormous at 10 billion objects (5 TB of metadata).
The metadata layer uses a dual-store architecture:
PostgreSQL handles bucket-level and object-level operations that require ACID transactions: creating buckets (must be globally unique), enabling versioning (atomic flag set), deleting objects (must atomically create a delete marker and decrement storage accounting), and listing objects with prefix/delimiter filtering (leverages B-tree indexes on the key column). PostgreSQL is sharded by bucket ID across a cluster of 20–50 nodes using Citus or application-level sharding.
Apache Cassandra handles the ultra-high-throughput GET metadata path — resolving (bucket, key, version) to chunk location map — where eventual consistency is acceptable and write throughput requirements exceed PostgreSQL's practical limits. The Cassandra partition key is (bucket_id, key_hash), distributing hot keys across the ring.
-- PostgreSQL schema for object metadata (sharded by bucket_id)
CREATE TABLE buckets (
bucket_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
account_id UUID NOT NULL,
bucket_name VARCHAR(63) UNIQUE NOT NULL,
region VARCHAR(32) NOT NULL,
versioning VARCHAR(16) DEFAULT 'Suspended', -- 'Enabled', 'Suspended'
created_at TIMESTAMPTZ DEFAULT now(),
storage_bytes BIGINT DEFAULT 0,
object_count BIGINT DEFAULT 0
);
CREATE TABLE objects (
object_id UUID DEFAULT gen_random_uuid(),
bucket_id UUID NOT NULL REFERENCES buckets(bucket_id),
object_key TEXT NOT NULL,
version_id VARCHAR(32) NOT NULL,
etag VARCHAR(64) NOT NULL,
content_type VARCHAR(128),
content_length BIGINT NOT NULL,
storage_class VARCHAR(16) DEFAULT 'STANDARD',
owner_id UUID NOT NULL,
is_delete_marker BOOLEAN DEFAULT FALSE,
chunk_manifest JSONB, -- ordered list of {chunk_hash, node_ids, size}
user_metadata JSONB, -- x-amz-meta-* headers
created_at TIMESTAMPTZ DEFAULT now(),
last_modified TIMESTAMPTZ DEFAULT now(),
PRIMARY KEY (bucket_id, object_key, version_id)
);
-- Index for prefix listing: lists all keys starting with "images/2026/"
CREATE INDEX idx_objects_prefix
ON objects (bucket_id, object_key text_pattern_ops)
WHERE is_delete_marker = FALSE;
-- Index for finding the latest version (current object)
CREATE INDEX idx_objects_latest
ON objects (bucket_id, object_key, last_modified DESC)
WHERE is_delete_marker = FALSE;
The chunk_manifest JSONB column stores the ordered list of chunk references — each entry containing the chunk hash (used as the content-addressed storage key), the list of storage node IDs holding that chunk, and the chunk size. For a 1 GB file split into 128 × 8 MB chunks, this manifest is approximately 10–20 KB per object. For objects below 8 MB (single-chunk objects), the manifest contains exactly one entry. The metadata service caches frequently accessed manifests in Redis with a 60-second TTL to avoid hot key contention in PostgreSQL.
7. Geo-Replication and Cross-Region Replication
Storing data in a single region exposes it to region-level failures: natural disasters, power grid outages, or major fiber cuts. Cross-region replication (CRR) protects against this by asynchronously copying objects to one or more geographically distant regions. The key word is asynchronous — synchronous replication would double upload latency by requiring the remote region to acknowledge the write before returning to the client.
Same-region replication (within a single region across multiple availability zones) is synchronous for the write acknowledgment: the primary storage node replicates to at least one secondary node in a different AZ before confirming the upload. This ensures that a single AZ failure does not cause data loss.
Cross-region replication is event-driven: when an object is successfully written to the primary region, a replication event is published to a Kafka topic. A dedicated CRR worker service consumes these events and copies the object (chunk by chunk) to the target region's storage nodes, then updates the target region's metadata store. The replication lag — the time between the primary write and the replica becoming available in the target region — is a critical SLA metric typically targeted at <15 minutes for standard class and <1 minute for one-zone-IA or S3-RTC (Replication Time Control) equivalent.
// Replication lag monitoring — Prometheus metrics exported by CRR worker
// Alerts when replication lag exceeds SLA thresholds
@Component
public class ReplicationLagMonitor {
private final MeterRegistry meterRegistry;
// Histogram: time from primary write to replica confirmation (seconds)
private final Timer replicationLagTimer;
// Counter: number of objects waiting to be replicated
private final AtomicLong replicationQueueDepth;
public ReplicationLagMonitor(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.replicationLagTimer = Timer.builder("storage.replication.lag")
.description("Time from primary write to cross-region replica confirmation")
.publishPercentiles(0.5, 0.95, 0.99)
.register(meterRegistry);
this.replicationQueueDepth = meterRegistry.gauge(
"storage.replication.queue.depth",
new AtomicLong(0)
);
}
public void recordReplicationCompleted(Instant primaryWriteTime) {
Duration lag = Duration.between(primaryWriteTime, Instant.now());
replicationLagTimer.record(lag);
replicationQueueDepth.decrementAndGet();
}
// Prometheus alerting rule (in prometheus rules YAML):
// - alert: ReplicationLagHigh
// expr: histogram_quantile(0.99, storage_replication_lag_seconds_bucket) > 900
// for: 5m
// severity: warning
// message: "CRR p99 lag > 15 minutes in {{ $labels.region }}"
}
Conflict resolution in geo-replication is relatively straightforward for object storage because objects are immutable once written. A new upload to the same key creates a new version (if versioning is enabled) or overwrites the current version. In the rare case of a simultaneous write to the same key in two regions (during a split-brain scenario), the system uses the version ID timestamp as the conflict resolution tiebreaker — last-write-wins at the version level, with both versions preserved in version history for operator review. This is the same approach used by S3's cross-region replication.
Replication vs Erasure Coding: Storage Trade-offs
| Aspect | 3-Way Replication | Erasure Coding RS(6+3) |
|---|---|---|
| Storage overhead | 3× (200% extra) | 1.5× (50% extra) |
| Read performance | Excellent — read from any replica | Good — read 6 shards, reassemble |
| Write performance | Good — write 3 copies in parallel | CPU cost for encoding parity shards |
| Rebuild cost | Low — copy from surviving replica | High — read 6 shards, decode, re-encode |
| Fault tolerance | Up to 2 node failures | Up to 3 node failures (any 6 of 9 shards) |
| Use case | Hot storage, latency-sensitive reads | Warm/cold storage, cost-optimized |
| CPU overhead | Negligible | Moderate (encoding/decoding GF operations) |
8. CDN Integration and Pre-Signed URLs
Serving files directly from storage nodes to global users is expensive and slow. A Content Delivery Network (CDN) solves both problems by caching objects at edge locations near users, reducing origin load by 90%+ for cacheable content and delivering sub-50ms response times from edge caches.
The integration pattern is straightforward: the CDN origin is configured to point at the storage system's API endpoint. On a cache miss, the CDN fetches from origin and caches the response with the Cache-Control and ETag headers provided by the storage API. Subsequent requests are served directly from the CDN edge. Cache invalidation is triggered via the CDN's purge API when an object is overwritten or deleted — typically done asynchronously via a Kafka consumer that listens to object modification events and issues purge requests to the CDN API.
Pre-signed URLs solve the challenge of granting temporary, unauthenticated access to private objects — for example, allowing a user to download their private document for the next 15 minutes without exposing your API credentials. The URL encodes the access parameters (bucket, key, expiry timestamp, allowed operations) and a cryptographic signature computed with HMAC-SHA256 using the account's secret key. The storage API verifies the signature on receipt, checks the expiry, and serves the content if valid — without requiring any authentication header from the client.
// Java: Generate a pre-signed URL for temporary authenticated access
public class PreSignedUrlGenerator {
private static final String ALGORITHM = "HmacSHA256";
public String generatePresignedGetUrl(
String accessKeyId,
String secretKey,
String bucket,
String objectKey,
Duration validity) throws Exception {
Instant expiry = Instant.now().plus(validity);
long expiresAt = expiry.getEpochSecond();
// Canonical string to sign — must match server-side verification exactly
// Format: METHOD\nBUCKET\nKEY\nEXPIRY\nACCESS_KEY
String stringToSign = String.join("\n",
"GET",
bucket,
objectKey,
String.valueOf(expiresAt),
accessKeyId
);
// HMAC-SHA256 signature using the account secret key
Mac mac = Mac.getInstance(ALGORITHM);
mac.init(new SecretKeySpec(secretKey.getBytes(StandardCharsets.UTF_8), ALGORITHM));
byte[] signatureBytes = mac.doFinal(
stringToSign.getBytes(StandardCharsets.UTF_8)
);
String signature = Base64.getUrlEncoder().withoutPadding()
.encodeToString(signatureBytes);
// Assemble the pre-signed URL
return String.format(
"https://storage.example.com/%s/%s?X-Access-Key-Id=%s&X-Expires=%d&X-Signature=%s",
URLEncoder.encode(bucket, StandardCharsets.UTF_8),
URLEncoder.encode(objectKey, StandardCharsets.UTF_8),
URLEncoder.encode(accessKeyId, StandardCharsets.UTF_8),
expiresAt,
URLEncoder.encode(signature, StandardCharsets.UTF_8)
);
}
}
// Server-side verification (in the storage API handler):
// 1. Extract X-Access-Key-Id, X-Expires, X-Signature from query params
// 2. Check X-Expires > now() — reject expired URLs with 403
// 3. Reconstruct the same canonical string using current request parameters
// 4. Compute HMAC-SHA256 using the account's secret key (looked up by access key ID)
// 5. Constant-time compare computed signature vs X-Signature — reject mismatch with 403
// 6. Serve the object content with appropriate Cache-Control headers
Pre-signed URLs can also be generated for upload operations — allowing a client to upload directly to the storage system (bypassing your API servers entirely) with time-limited credentials. This dramatically reduces infrastructure load for large file uploads, as the upload traffic goes directly to storage nodes without routing through application servers. The pattern is used extensively in mobile apps and browser-based file upload UIs.
"The hard part of object storage is not storing bytes — it's the metadata at scale, the durability guarantees, and the operational procedures for node failure recovery without data loss. The bytes are easy; the correctness is hard."
— Inspired by Amazon S3 design principles
Key Takeaways
- A file storage system at scale handles 365 PB of data over 10 years for our 1B user scenario. This requires horizontal scaling of both the data plane (storage nodes via consistent hashing) and the control plane (metadata sharded across PostgreSQL/Cassandra).
- Separate the control plane (IAM, metadata, billing) from the data plane (byte storage) to enable independent scaling. The data plane can scale to exabytes by adding commodity storage nodes without touching control plane capacity.
- Chunking and multipart upload are fundamental requirements for files >5 MB. They enable resumable uploads, parallel throughput optimization, and content-addressable deduplication across object versions.
- Erasure coding (RS 6+3) reduces storage overhead from 3× to 1.5× compared to full replication, cutting physical storage costs by 50% — a critical optimization at petabyte scale. Use 3-way replication only for hot storage where rebuild latency matters.
- Metadata is the critical path — every read and write requires a metadata lookup. A dual-store PostgreSQL (ACID, listings) + Cassandra (high-throughput key resolution) architecture balances correctness and throughput.
- Cross-region replication must be asynchronous to avoid doubling write latency. Monitor replication lag as a first-class SLO metric; alert when p99 lag exceeds your recovery time objective.
- Pre-signed URLs provide time-limited, credential-free access to private objects — use HMAC-SHA256 signatures with short expiry windows (<1 hour for user-facing links). Never embed long-lived credentials in client applications.
- CDN integration reduces origin load by 90%+ and delivers sub-50ms latency globally. Cache invalidation must be event-driven (Kafka consumer → CDN purge API) to keep edge caches consistent with object updates.
Conclusion
Designing a file storage system at the scale of S3 or Google Cloud Storage reveals a fundamental truth about distributed systems: the easy part is storing bytes, and the hard part is everything else. Metadata consistency, durability across hardware failures, geo-replication with controlled lag, cost-optimized erasure coding, and secure access delegation are each PhD-worthy problems in their own right.
The architecture we've covered — flat object namespace with virtual directory illusion, multipart upload with SHA-256 chunk verification, consistent hashing for storage node placement, erasure coding for cold storage, dual-store metadata with PostgreSQL for listings and Cassandra for lookups, async cross-region replication with Kafka, and HMAC-signed pre-signed URLs — represents the battle-tested patterns that power the largest object storage systems in production today.
When preparing for a system design interview covering file storage, lead with requirements and scale estimation to anchor the design decisions. Walk through the control plane / data plane separation. Explain why chunking is necessary and how multipart upload achieves resumability and parallelism. Differentiate replication from erasure coding with the concrete cost numbers. Finish with the metadata schema and the pre-signed URL security model. This structured approach, grounded in real production trade-offs, will distinguish your answer as senior-level design thinking.
Leave a Comment
Related Posts
Software Engineer · Java · Spring Boot · Microservices