Infrastructure for AI.
Models get the headlines; infrastructure decides whether they train at all. This is a field guide to the layer underneath — GPU nodes, the fabrics that join them, the storage that feeds them, the schedulers that share them, and the power that limits everything — written by an enterprise storage architect crossing into AI systems, for engineers making the same crossing. Numbers are stated with their sources and hedges; the calculators turn the architecture into arithmetic you can defend in a design review.
Why AI infrastructure is its own discipline
Classic enterprise infrastructure optimizes for isolation: thousands of independent workloads, each protected from its neighbors. AI training is the opposite — one job, thousands of accelerators, moving in lockstep. Every optimizer step, every GPU exchanges gradients with the others; the whole machine advances at the pace of its slowest participant. That single fact drives everything else on this page.
| Consequence | Why | What it demands |
|---|---|---|
| Tail latency is everything | A synchronous all-reduce waits for the last GPU. One slow link or hot chip stalls thousands. | Non-blocking fabrics, topology-aware placement, straggler detection |
| Failure is routine, not exceptional | At cluster scale, component failures arrive weekly or daily. The job must outlive them. | Fast checkpointing, quick restart, spare capacity |
| Utilization is the business model | Accelerators are the dominant capex. An idle GPU burns money at a rate storage people aren't used to. | Everything else — network, storage, scheduling — exists to keep GPUs fed |
| Power is the ceiling | Racks moved from ~10 kW to well past 100 kW in one hardware generation. | Liquid cooling, facility co-design, power-aware scheduling |
Anatomy of a GPU node
A modern training node is a small datacenter in a box: eight accelerators, a private interconnect between them, hundreds of gigabytes of HBM, multiple RDMA NICs, and local NVMe. The hierarchy of bandwidths is the whole story — each hop away from the GPU die costs roughly an order of magnitude:
| Link | Ballpark bandwidth | Notes |
|---|---|---|
| HBM ↔ GPU | ~2–3.35 TB/s per GPU (A100 80GB ≈ 2 TB/s, H100 SXM ≈ 3.35 TB/s) | The fastest memory in the system — and the scarcest. Most kernels are bound by it, not by FLOPs. |
| GPU ↔ GPU (NVLink, in-node) | A100/NVLink3 ≈ 600 GB/s · H100/NVLink4 ≈ 900 GB/s per GPU | NVSwitch makes it all-to-all inside the node; tensor parallelism lives here. |
| GPU ↔ CPU/PCIe | PCIe Gen5 x16 ≈ 64 GB/s | An order of magnitude below NVLink — the reason "just stage through the CPU" fails. |
| Node ↔ node (RDMA NIC) | 400 Gb/s (≈ 50 GB/s) per NIC, typically one per GPU | InfiniBand NDR or 400G RoCEv2; data-parallel gradient traffic lives here. |
| Node ↔ storage | whatever you engineered — often the neglected link | This is the layer this guide's author lives in. See section 04. |
The three networks (never mix them)
A serious training cluster runs three physically or logically separate networks — the same discipline as a SAN shop that never let backup traffic touch the production fabric:
| Fabric | Carries | Technology |
|---|---|---|
| Compute fabric | Gradient exchange, collective operations, pipeline traffic between nodes | InfiniBand (NDR 400 Gb/s) or RoCEv2 on 400G Ethernet; rail-optimized fat-tree topologies; GPUDirect RDMA so NIC↔GPU transfers bypass the CPU |
| Storage fabric | Training-data reads, checkpoint writes | Ethernet or IB to the parallel filesystem / object tier; sometimes converged with compute — a decision you'll regret at checkpoint time |
| Frontend / management | SSH, scheduling, telemetry, image pulls | Ordinary Ethernet — keep it boring |
The vocabulary that matters: collectives
Distributed training speaks NCCL. Three verbs cover most of it: all-reduce (every GPU ends with the sum of everyone's gradients — the heartbeat of data parallelism), all-gather (everyone ends with everyone's shard — how ZeRO/FSDP reassembles weights), and reduce-scatter (the sum, but sharded — the other half of FSDP). Ring and tree algorithms trade latency against bandwidth; the fabric's job is to make either work at full line rate on every rail simultaneously.
InfiniBand vs RoCE — the decision, not the religion
| Factor | InfiniBand | RoCEv2 (400G Ethernet) |
|---|---|---|
| Out-of-box behavior | Lossless by design; credit-based flow control; SHARP in-network reduction on supported switches | Lossless only if you make it so — PFC + ECN must be engineered and verified per queue |
| Operations | Separate skill set and toolchain (subnet manager, ibdiagnet) | Your existing Ethernet team, monitoring, and vendors |
| Ecosystem lock | Effectively one vendor | Multi-vendor; the hyperscalers run frontier training on it, which settled the "is it viable" argument |
| Where each wins | Turn-key trainers, sites without deep Ethernet ops, SHARP-heavy collective profiles | Very large buildouts, Ethernet-standardized shops, cost-sensitive scale |
RoCE tuning checklist — the shortlist that separates working clusters from mystery stalls: enable and verify PFC on the RDMA traffic class end-to-end (one unmarked hop poisons the path); configure ECN/DCQCN thresholds per vendor guidance and confirm marking actually occurs under load; isolate RDMA in its own queue; watch for PFC storms and pause propagation — the RoCE failure mode that looks exactly like a hung filesystem; and validate ECMP hashing spreads flows across rails rather than pinning elephants to one link. Then run nccl-tests (all_reduce_perf) at every scale step and keep the numbers — the baseline is your future 2 a.m. evidence.
Storage for AI — the layer that starves clusters quietly
Training storage has two personalities, and sizing for the wrong one is the classic mistake. The read side streams the dataset: high aggregate sequential bandwidth, but with the small-file trap — vision and multimodal datasets can be millions of tiny files, which murders filesystems that were benchmarked on large sequential reads. The write side is checkpointing: near-idle for an hour, then every node writing gigabytes at once in a burst that must finish fast, because (unless checkpoints are asynchronous) the GPUs wait.
| Tier | Role | Players |
|---|---|---|
| Parallel filesystem | The hot tier: dataset reads + checkpoint writes at cluster bandwidth | Lustre, IBM Storage Scale (GPFS), WEKA, VAST, DDN EXAScaler, BeeGFS |
| Object storage | The lake: raw corpora, processed shards, checkpoint archive | S3 and compatibles; shard formats (WebDataset-style tars, Parquet) exist precisely to turn small files into big objects |
| Node-local NVMe | Cache and scratch: staged shards, spill, temporary checkpoint landing before async upload | Multiple TB per node; often the cheapest bandwidth in the whole design |
GPUDirect Storage (GDS) extends the RDMA idea to disk: NVMe/NVMe-oF reads DMA straight into GPU memory, skipping the CPU bounce buffer. Where it applies, it lifts throughput and frees CPU cycles for the data loader — which is often the actual bottleneck, spending its life decoding and augmenting samples.
Streaming datasets without starving GPUs (or the budget)
| Pattern | How it works | Watch for |
|---|---|---|
| Sharded sequential (WebDataset-style tars) | Thousands of small samples packed into large shard files; loaders stream shards sequentially and shuffle within a buffer | Shuffle quality depends on shard count × buffer size; resume-at-step needs deterministic shard ordering |
| Streaming from object storage | Shards pulled straight from S3-compatible stores, often through a node-local NVMe cache tier | Egress fees when compute and data live in different clouds — the line item that ambushes multi-cloud training; cache hit rate is the number to graph |
| Framework streaming loaders | HF datasets streaming / Mosaic-style StreamingDataset add resumability and deterministic shuffling on top of object shards | Loader CPU cost still applies; benchmark decode/augment before blaming any storage tier |
Checkpoint lifecycle mirrors backup tiering exactly: hot — latest N checkpoints on the parallel filesystem or node-local NVMe for instant restart; warm — recent history in object storage for rollback and evaluation forks; cold — milestone checkpoints archived (and the rest deleted — a 70B run checkpointing hourly writes ~24 TB/day, and someone has to say what survives). Retention policy is a training-infra decision, not an afterthought; you have written this exact policy before, it was called backup rotation.
Checkpointing — DR discipline at training speed
A checkpoint is a point-in-time copy of the training state: weights, optimizer state, and bookkeeping. If that sounds like replication and RPO thinking, it is — the same math wearing a lab coat. The state is large: in mixed-precision Adam training, weights + gradients + optimizer state commonly land around ~16 bytes per parameter of live memory, and the persisted checkpoint (weights + optimizer) around ~12–14 bytes per parameter before any sharded-format overhead. A 70B-parameter run checkpoints roughly a terabyte. Every interval.
| Checkpoint backend | Strength | Failure mode |
|---|---|---|
| Parallel FS (Lustre / GPFS / WEKA) | Aggregate bandwidth scales with the cluster; sharded writes land in parallel | Metadata storms when every rank creates files simultaneously — serialize through per-rank directories or consolidated formats |
| Plain NFS | Simple, everywhere | Single-server bandwidth wall; fine for small runs, a stall generator at scale |
| Direct to object (S3) | No PFS to run; durability built in | Per-object latency and request costs; multipart tuning required; restart pull time becomes your real RTO |
| Local NVMe → async upload | Fastest visible checkpoint (GPUs resume in seconds); upload happens off the critical path | A node loss during the async window loses that checkpoint — size the window against your failure model, not your optimism |
Two levers change the math entirely: asynchronous checkpointing (snapshot to host/local memory, persist in the background — the GPU stall shrinks from the full write time to the snapshot time) and differential/sharded formats (each data-parallel rank persists only its shard; distributed checkpoint formats reassemble on restore). Profile before optimizing: time-to-safe vs time-to-stall-end, per-rank write throughput, and metadata ops/sec tell you which of the three walls — bandwidth, metadata, or serialization — you're actually hitting.
Design questions, straight from DR practice: how much progress can you afford to lose (checkpoint interval = RPO)? How fast must the write finish (synchronous checkpoint stall = RTO cousin)? Where do copies land (local NVMe → async to PFS → archive to object = tiered replication)? The calculator below turns those into required bandwidth.
Slurm vs Kubernetes — two cultures, one cluster
| Slurm | Kubernetes | |
|---|---|---|
| Native shape | Batch jobs with a start and an end; gang scheduling built in — all N nodes or nothing | Long-running services; gang semantics bolted on (Kueue, Volcano) |
| Where it wins | Training: topology-aware placement, backfill, fair-share, decades of HPC muscle | Inference and platforms: autoscaling, rollouts, service discovery, the entire cloud-native toolchain |
| GPU sharing | Whole-GPU allocations, MIG partitions as resources | Device plugins, MIG, time-slicing for small workloads |
| Common reality | Both: Slurm (or a Slurm-like) for the training partition, Kubernetes for serving — with a growing middle layer trying to unify them | |
The Kubernetes GPU story just changed: DRA
Dynamic Resource Allocation graduated to GA in Kubernetes 1.34 (September 2025) and replaces the count-only device-plugin model ("nvidia.com/gpu: 1") with attribute-aware claims — request "a GPU with ≥80 GB and NVLink" and let the scheduler find it. OpenShift 4.21 ships the same GA framework, which matters if your platform team lives there. Two operational realities survive the upgrade: MIG fragmentation — carving GPUs into slices strands capacity when slice-shaped holes don't match slice-shaped requests, exactly like thin-pool fragmentation on an array — and gang scheduling, which vanilla K8s still delegates to Kueue/Volcano while Slurm has it natively: an all-or-nothing job that gets 7 of 8 nodes deadlocks politely forever.
partition ≈ a pool/queue; sbatch submits, srun launches, squeue shows state, sinfo shows nodes. If you can read a zoning config, you can read a Slurm batch script by lunch.Inference infrastructure — a memory business
Training is a compute problem; serving is a memory-bandwidth and memory-capacity problem. Generating each token requires streaming the model weights through the GPU, plus reading and appending the KV cache — the attention keys/values for every token already in the context. The KV cache is why long contexts and big batches collide: it grows linearly with tokens × layers × heads, eating the HBM you wanted for batch size.
| Concept | What it means operationally |
|---|---|
| TTFT / TPOT | Time-to-first-token (prefill, compute-heavy) vs time-per-output-token (decode, bandwidth-heavy). Different bottlenecks, tune separately. |
| Continuous batching | Requests join and leave the batch mid-flight — the single biggest throughput lever in modern servers (vLLM et al.). |
| Paged KV cache | Virtual-memory-style paging for the cache — kills fragmentation, raises effective batch size. |
| Quantization | INT8/FP8/INT4 weights: 2–4× less memory traffic per token, with accuracy trade-offs you must evaluate, not assume. |
| GQA / MQA | Fewer KV heads than attention heads — architectural KV-cache compression; the reason Llama-3's cache is 4× smaller than a naive reading suggests. |
Choosing a serving engine (mid-2026 map)
| Engine | Status & character | Reach for it when |
|---|---|---|
| vLLM | The default: broadest model support, paged KV cache, continuous batching, OpenAI-compatible API, fastest new-architecture uptake | General-purpose serving; the safe first answer |
| SGLang | RadixAttention prefix caching; strong on multi-turn, agentic, and heavy shared-prefix workloads | Chat/RAG/agents where contexts share long prefixes |
| TensorRT-LLM | NVIDIA's compiled path: peak per-GPU performance, more build/ops effort, NVIDIA-only | Squeezing a fixed model on fixed NVIDIA hardware at scale |
| Triton Inference Server | Multi-framework, multi-model orchestration layer (can front TensorRT-LLM and vLLM backends) | Mixed model zoo, enterprise serving platform needs |
| TGI | Maintenance mode since Dec 11, 2025; repo archived read-only Mar 21, 2026. Hugging Face's own docs point migrations at vLLM or SGLang | Existing deployments only — plan the migration, don't start new ones |
Power & cooling — the constraint that outranks the rest
The numbers moved fast: an H100 SXM module is rated 700 W; an 8-GPU DGX H100 system is ~10.2 kW; a GB200 NVL72 rack is on the order of ~120 kW and is liquid-cooled by design. Traditional datacenter rows were built for 5–15 kW racks on air. That gap — not chip supply, not budget — is what actually limits many AI buildouts.
| Approach | Where it lands |
|---|---|
| Air (hot/cold aisle) | Fine to roughly 20–30 kW/rack with containment; beyond that, physics objects |
| Direct-to-chip liquid | Cold plates on GPUs/CPUs; the mainstream answer for dense trainers; needs CDUs and facility water |
| Rear-door heat exchangers | Retrofit path: liquid-cooled door soaks the exhaust; buys headroom without re-plumbing the servers |
| Immersion | The tank: highest density, biggest operational change; still niche |
Architecture as arithmetic
Four calculators, all client-side, each encoding a rule of thumb the field actually uses — with its assumptions printed, because a number you can't defend is worse than no number.
Checkpoint size & bandwidth
GPU memory for training
Training time estimator
KV-cache sizing
For storage engineers making this move
The premise of this whole site: your instincts transfer better than the job boards suggest. The mapping, concretely: