The machinery under the models

Infrastructure for AI.

Models get the headlines; infrastructure decides whether they train at all. This is a field guide to the layer underneath — GPU nodes, the fabrics that join them, the storage that feeds them, the schedulers that share them, and the power that limits everything — written by an enterprise storage architect crossing into AI systems, for engineers making the same crossing. Numbers are stated with their sources and hedges; the calculators turn the architecture into arithmetic you can defend in a design review.

By Mahmoud Khalifa · 22 years in enterprise storage & infrastructure. Vendor peak specs are datasheet values — always derate for the real world.

01 · Foundations

Why AI infrastructure is its own discipline

Classic enterprise infrastructure optimizes for isolation: thousands of independent workloads, each protected from its neighbors. AI training is the opposite — one job, thousands of accelerators, moving in lockstep. Every optimizer step, every GPU exchanges gradients with the others; the whole machine advances at the pace of its slowest participant. That single fact drives everything else on this page.

Consequence	Why	What it demands
Tail latency is everything	A synchronous all-reduce waits for the last GPU. One slow link or hot chip stalls thousands.	Non-blocking fabrics, topology-aware placement, straggler detection
Failure is routine, not exceptional	At cluster scale, component failures arrive weekly or daily. The job must outlive them.	Fast checkpointing, quick restart, spare capacity
Utilization is the business model	Accelerators are the dominant capex. An idle GPU burns money at a rate storage people aren't used to.	Everything else — network, storage, scheduling — exists to keep GPUs fed
Power is the ceiling	Racks moved from ~10 kW to well past 100 kW in one hardware generation.	Liquid cooling, facility co-design, power-aware scheduling

The one-sentence version: AI infrastructure is the art of keeping very expensive compute saturated — and every subsystem on this page is judged by exactly one metric: does it keep the GPUs busy?

02 · The Node

Anatomy of a GPU node

A modern training node is a small datacenter in a box: eight accelerators, a private interconnect between them, hundreds of gigabytes of HBM, multiple RDMA NICs, and local NVMe. The hierarchy of bandwidths is the whole story — each hop away from the GPU die costs roughly an order of magnitude:

Link	Ballpark bandwidth	Notes
HBM ↔ GPU	~2–3.35 TB/s per GPU (A100 80GB ≈ 2 TB/s, H100 SXM ≈ 3.35 TB/s)	The fastest memory in the system — and the scarcest. Most kernels are bound by it, not by FLOPs.
GPU ↔ GPU (NVLink, in-node)	A100/NVLink3 ≈ 600 GB/s · H100/NVLink4 ≈ 900 GB/s per GPU	NVSwitch makes it all-to-all inside the node; tensor parallelism lives here.
GPU ↔ CPU/PCIe	PCIe Gen5 x16 ≈ 64 GB/s	An order of magnitude below NVLink — the reason "just stage through the CPU" fails.
Node ↔ node (RDMA NIC)	400 Gb/s (≈ 50 GB/s) per NIC, typically one per GPU	InfiniBand NDR or 400G RoCEv2; data-parallel gradient traffic lives here.
Node ↔ storage	whatever you engineered — often the neglected link	This is the layer this guide's author lives in. See section 04.

MEMORY, NOT FLOPSThe advertised petaflops assume the data is already in registers. Real workloads spend their lives moving bytes across the table above — which is why "memory-bandwidth bound" is the default condition and why arithmetic intensity (FLOPs per byte moved) is the number performance engineers actually optimize.

DATASHEET vs DELIVEREDVendor peaks are best-case, dense-math, boost-clock figures. Sustained real-model throughput is expressed as MFU (model FLOPs utilization) — and 35–45% MFU on large LLM training is considered good. Plan capacity on delivered numbers, exactly the way you derate a storage array's spec-sheet IOPS.

03 · Fabrics

The three networks (never mix them)

A serious training cluster runs three physically or logically separate networks — the same discipline as a SAN shop that never let backup traffic touch the production fabric:

Fabric	Carries	Technology
Compute fabric	Gradient exchange, collective operations, pipeline traffic between nodes	InfiniBand (NDR 400 Gb/s) or RoCEv2 on 400G Ethernet; rail-optimized fat-tree topologies; GPUDirect RDMA so NIC↔GPU transfers bypass the CPU
Storage fabric	Training-data reads, checkpoint writes	Ethernet or IB to the parallel filesystem / object tier; sometimes converged with compute — a decision you'll regret at checkpoint time
Frontend / management	SSH, scheduling, telemetry, image pulls	Ordinary Ethernet — keep it boring

The vocabulary that matters: collectives

Distributed training speaks NCCL. Three verbs cover most of it: all-reduce (every GPU ends with the sum of everyone's gradients — the heartbeat of data parallelism), all-gather (everyone ends with everyone's shard — how ZeRO/FSDP reassembles weights), and reduce-scatter (the sum, but sharded — the other half of FSDP). Ring and tree algorithms trade latency against bandwidth; the fabric's job is to make either work at full line rate on every rail simultaneously.

InfiniBand vs RoCE — the decision, not the religion

Factor	InfiniBand	RoCEv2 (400G Ethernet)
Out-of-box behavior	Lossless by design; credit-based flow control; SHARP in-network reduction on supported switches	Lossless only if you make it so — PFC + ECN must be engineered and verified per queue
Operations	Separate skill set and toolchain (subnet manager, ibdiagnet)	Your existing Ethernet team, monitoring, and vendors
Ecosystem lock	Effectively one vendor	Multi-vendor; the hyperscalers run frontier training on it, which settled the "is it viable" argument
Where each wins	Turn-key trainers, sites without deep Ethernet ops, SHARP-heavy collective profiles	Very large buildouts, Ethernet-standardized shops, cost-sensitive scale

RoCE tuning checklist — the shortlist that separates working clusters from mystery stalls: enable and verify PFC on the RDMA traffic class end-to-end (one unmarked hop poisons the path); configure ECN/DCQCN thresholds per vendor guidance and confirm marking actually occurs under load; isolate RDMA in its own queue; watch for PFC storms and pause propagation — the RoCE failure mode that looks exactly like a hung filesystem; and validate ECMP hashing spreads flows across rails rather than pinning elephants to one link. Then run nccl-tests (all_reduce_perf) at every scale step and keep the numbers — the baseline is your future 2 a.m. evidence.

SAN INSTINCTS TRANSFEROversubscription ratios, fan-in congestion, head-of-line blocking, "the fabric is fine, the host is misconfigured" — every diagnostic instinct from Fibre Channel maps onto IB/RoCE almost one-to-one. Different headers, same physics, same 2 a.m. arguments.

04 · The Data Side

Storage for AI — the layer that starves clusters quietly

Training storage has two personalities, and sizing for the wrong one is the classic mistake. The read side streams the dataset: high aggregate sequential bandwidth, but with the small-file trap — vision and multimodal datasets can be millions of tiny files, which murders filesystems that were benchmarked on large sequential reads. The write side is checkpointing: near-idle for an hour, then every node writing gigabytes at once in a burst that must finish fast, because (unless checkpoints are asynchronous) the GPUs wait.

Tier	Role	Players
Parallel filesystem	The hot tier: dataset reads + checkpoint writes at cluster bandwidth	Lustre, IBM Storage Scale (GPFS), WEKA, VAST, DDN EXAScaler, BeeGFS
Object storage	The lake: raw corpora, processed shards, checkpoint archive	S3 and compatibles; shard formats (WebDataset-style tars, Parquet) exist precisely to turn small files into big objects
Node-local NVMe	Cache and scratch: staged shards, spill, temporary checkpoint landing before async upload	Multiple TB per node; often the cheapest bandwidth in the whole design

GPUDirect Storage (GDS) extends the RDMA idea to disk: NVMe/NVMe-oF reads DMA straight into GPU memory, skipping the CPU bounce buffer. Where it applies, it lifts throughput and frees CPU cycles for the data loader — which is often the actual bottleneck, spending its life decoding and augmenting samples.

Streaming datasets without starving GPUs (or the budget)

Pattern	How it works	Watch for
Sharded sequential (WebDataset-style tars)	Thousands of small samples packed into large shard files; loaders stream shards sequentially and shuffle within a buffer	Shuffle quality depends on shard count × buffer size; resume-at-step needs deterministic shard ordering
Streaming from object storage	Shards pulled straight from S3-compatible stores, often through a node-local NVMe cache tier	Egress fees when compute and data live in different clouds — the line item that ambushes multi-cloud training; cache hit rate is the number to graph
Framework streaming loaders	HF datasets streaming / Mosaic-style StreamingDataset add resumability and deterministic shuffling on top of object shards	Loader CPU cost still applies; benchmark decode/augment before blaming any storage tier

Checkpoint lifecycle mirrors backup tiering exactly: hot — latest N checkpoints on the parallel filesystem or node-local NVMe for instant restart; warm — recent history in object storage for rollback and evaluation forks; cold — milestone checkpoints archived (and the rest deleted — a 70B run checkpointing hourly writes ~24 TB/day, and someone has to say what survives). Retention policy is a training-infra decision, not an afterthought; you have written this exact policy before, it was called backup rotation.

THE LOADER LIESWhen GPUs idle, everyone blames the storage array. Measure the pipeline first: decode/augment CPU cost, dataloader worker count, and host↔device copies are the culprit as often as the filesystem. The array graphs being flat is your first clue it was never the array.

CAPACITY ≠ BANDWIDTHA petabyte that delivers 20 GB/s cannot feed a large cluster. AI storage is bought in GB/s first and PB second — invert your storage-procurement instincts.

Checkpointing — DR discipline at training speed

A checkpoint is a point-in-time copy of the training state: weights, optimizer state, and bookkeeping. If that sounds like replication and RPO thinking, it is — the same math wearing a lab coat. The state is large: in mixed-precision Adam training, weights + gradients + optimizer state commonly land around ~16 bytes per parameter of live memory, and the persisted checkpoint (weights + optimizer) around ~12–14 bytes per parameter before any sharded-format overhead. A 70B-parameter run checkpoints roughly a terabyte. Every interval.

Checkpoint backend	Strength	Failure mode
Parallel FS (Lustre / GPFS / WEKA)	Aggregate bandwidth scales with the cluster; sharded writes land in parallel	Metadata storms when every rank creates files simultaneously — serialize through per-rank directories or consolidated formats
Plain NFS	Simple, everywhere	Single-server bandwidth wall; fine for small runs, a stall generator at scale
Direct to object (S3)	No PFS to run; durability built in	Per-object latency and request costs; multipart tuning required; restart pull time becomes your real RTO
Local NVMe → async upload	Fastest visible checkpoint (GPUs resume in seconds); upload happens off the critical path	A node loss during the async window loses that checkpoint — size the window against your failure model, not your optimism

Two levers change the math entirely: asynchronous checkpointing (snapshot to host/local memory, persist in the background — the GPU stall shrinks from the full write time to the snapshot time) and differential/sharded formats (each data-parallel rank persists only its shard; distributed checkpoint formats reassemble on restore). Profile before optimizing: time-to-safe vs time-to-stall-end, per-rank write throughput, and metadata ops/sec tell you which of the three walls — bandwidth, metadata, or serialization — you're actually hitting.

Design questions, straight from DR practice: how much progress can you afford to lose (checkpoint interval = RPO)? How fast must the write finish (synchronous checkpoint stall = RTO cousin)? Where do copies land (local NVMe → async to PFS → archive to object = tiered replication)? The calculator below turns those into required bandwidth.

05 · Running It

Slurm vs Kubernetes — two cultures, one cluster

	Slurm	Kubernetes
Native shape	Batch jobs with a start and an end; gang scheduling built in — all N nodes or nothing	Long-running services; gang semantics bolted on (Kueue, Volcano)
Where it wins	Training: topology-aware placement, backfill, fair-share, decades of HPC muscle	Inference and platforms: autoscaling, rollouts, service discovery, the entire cloud-native toolchain
GPU sharing	Whole-GPU allocations, MIG partitions as resources	Device plugins, MIG, time-slicing for small workloads
Common reality	Both: Slurm (or a Slurm-like) for the training partition, Kubernetes for serving — with a growing middle layer trying to unify them

The Kubernetes GPU story just changed: DRA

Dynamic Resource Allocation graduated to GA in Kubernetes 1.34 (September 2025) and replaces the count-only device-plugin model ("nvidia.com/gpu: 1") with attribute-aware claims — request "a GPU with ≥80 GB and NVLink" and let the scheduler find it. OpenShift 4.21 ships the same GA framework, which matters if your platform team lives there. Two operational realities survive the upgrade: MIG fragmentation — carving GPUs into slices strands capacity when slice-shaped holes don't match slice-shaped requests, exactly like thin-pool fragmentation on an array — and gang scheduling, which vanilla K8s still delegates to Kueue/Volcano while Slurm has it natively: an all-or-nothing job that gets 7 of 8 nodes deadlocks politely forever.

Vocabulary bridge: a Slurm partition ≈ a pool/queue; sbatch submits, srun launches, squeue shows state, sinfo shows nodes. If you can read a zoning config, you can read a Slurm batch script by lunch.

06 · Serving

Inference infrastructure — a memory business

Training is a compute problem; serving is a memory-bandwidth and memory-capacity problem. Generating each token requires streaming the model weights through the GPU, plus reading and appending the KV cache — the attention keys/values for every token already in the context. The KV cache is why long contexts and big batches collide: it grows linearly with tokens × layers × heads, eating the HBM you wanted for batch size.

Concept	What it means operationally
TTFT / TPOT	Time-to-first-token (prefill, compute-heavy) vs time-per-output-token (decode, bandwidth-heavy). Different bottlenecks, tune separately.
Continuous batching	Requests join and leave the batch mid-flight — the single biggest throughput lever in modern servers (vLLM et al.).
Paged KV cache	Virtual-memory-style paging for the cache — kills fragmentation, raises effective batch size.
Quantization	INT8/FP8/INT4 weights: 2–4× less memory traffic per token, with accuracy trade-offs you must evaluate, not assume.
GQA / MQA	Fewer KV heads than attention heads — architectural KV-cache compression; the reason Llama-3's cache is 4× smaller than a naive reading suggests.

Choosing a serving engine (mid-2026 map)

Engine	Status & character	Reach for it when
vLLM	The default: broadest model support, paged KV cache, continuous batching, OpenAI-compatible API, fastest new-architecture uptake	General-purpose serving; the safe first answer
SGLang	RadixAttention prefix caching; strong on multi-turn, agentic, and heavy shared-prefix workloads	Chat/RAG/agents where contexts share long prefixes
TensorRT-LLM	NVIDIA's compiled path: peak per-GPU performance, more build/ops effort, NVIDIA-only	Squeezing a fixed model on fixed NVIDIA hardware at scale
Triton Inference Server	Multi-framework, multi-model orchestration layer (can front TensorRT-LLM and vLLM backends)	Mixed model zoo, enterprise serving platform needs
TGI	Maintenance mode since Dec 11, 2025; repo archived read-only Mar 21, 2026. Hugging Face's own docs point migrations at vLLM or SGLang	Existing deployments only — plan the migration, don't start new ones

BENCHMARK HYGIENEEvery engine's published numbers beat every other engine's — because each vendor benchmarks its best case. Trust only measurements on your model, your GPUs, your traffic shape (prefill-heavy vs decode-heavy changes the winner). And instrument tail latency against KV-cache pressure: p99 spikes when the cache approaches capacity and eviction/preemption starts — an alertable signal most APM stacks never see.

07 · The Ceiling

Power & cooling — the constraint that outranks the rest

The numbers moved fast: an H100 SXM module is rated 700 W; an 8-GPU DGX H100 system is ~10.2 kW; a GB200 NVL72 rack is on the order of ~120 kW and is liquid-cooled by design. Traditional datacenter rows were built for 5–15 kW racks on air. That gap — not chip supply, not budget — is what actually limits many AI buildouts.

Approach	Where it lands
Air (hot/cold aisle)	Fine to roughly 20–30 kW/rack with containment; beyond that, physics objects
Direct-to-chip liquid	Cold plates on GPUs/CPUs; the mainstream answer for dense trainers; needs CDUs and facility water
Rear-door heat exchangers	Retrofit path: liquid-cooled door soaks the exhaust; buys headroom without re-plumbing the servers
Immersion	The tank: highest density, biggest operational change; still niche

THE STORAGE ANGLEPower planning now belongs in storage design reviews too: a parallel-filesystem tier fast enough to feed a big cluster is itself a multi-rack, high-density system — and every kilowatt it takes competes with a GPU that wanted it.

08 · Calculators

Architecture as arithmetic

Four calculators, all client-side, each encoding a rule of thumb the field actually uses — with its assumptions printed, because a number you can't defend is worse than no number.

Checkpoint size & bandwidth

checkpoint ≈ params × (weight bytes + optimizer bytes) · required BW = size ÷ write window

parameters (billions)

weight precision

optimizer state

acceptable write window (seconds)

checkpoint interval (minutes)

GPU memory for training

model states ≈ params × 16 B (bf16 weights+grads + fp32 Adam) ÷ sharding degree, + activations

parameters (billions)

state sharding (ZeRO-3/FSDP degree)

activations headroom (GB, batch/seq dependent)

GPU HBM (GB)

Training time estimator

FLOPs ≈ 6 × params × tokens (the standard dense-transformer rule) · time = FLOPs ÷ (GPUs × peak × MFU)

parameters (billions)

training tokens (trillions)

GPUs

per-GPU peak (dense BF16 TFLOPS)

MFU (0–1)

$ per GPU-hour (optional)

KV-cache sizing

bytes/token = 2 (K+V) × layers × kv_heads × head_dim × precision bytes

preset

layers

KV heads

head dim

precision

context length (tokens)

concurrent sequences

09 · The Crossing

For storage engineers making this move

The premise of this whole site: your instincts transfer better than the job boards suggest. The mapping, concretely:

SAN FABRICS → COMPUTE FABRICSZoning discipline, oversubscription math, congestion triage — Fibre Channel intuition lands directly on InfiniBand and RoCE. New encapsulation, identical failure modes.

REPLICATION/DR → CHECKPOINTINGRPO/RTO thinking maps one-to-one onto checkpoint interval and restart cost. You've already done this math for banks; now the "site" is a training job.

CAPACITY PLANNING → MEMORY MATHHBM budgets, KV-cache growth, and bytes-per-parameter are capacity planning with smaller units and worse consequences for guessing.

ARRAY PERF TRIAGE → PIPELINE TRIAGE"Is it the array, the fabric, or the host?" becomes "storage, loader, or interconnect?" — the same isolation methodology, the same graphs-first honesty.

MIGRATION RUNBOOKS → JOB OPERATIONSChange windows, rollback plans, and cutover choreography reappear as checkpoint-restore drills and cluster maintenance under running jobs.

WHAT'S GENUINELY NEWCollectives (NCCL), CUDA-level performance thinking, and scheduler internals. Learnable — the study path lives at ai.m-khalifa.com.

10 · Resources

The shortlist

Making Deep Learning Go BrrrrThe canonical mental model: compute-bound vs memory-bound vs overhead-bound. Read this first.horace.io The Ultra-Scale PlaybookTraining on GPU clusters end-to-end: parallelism strategies, memory math, real profiles.huggingface.co/nanotron GPU MODE lecturesThe working engineer's CUDA/performance curriculum, lecture by lecture.github.com/gpu-mode CUDA C++ Programming GuideThe primary source for the execution and memory model everything above sits on.docs.nvidia.com PyTorch DDP tutorialData parallelism from the framework's mouth — what all-reduce is actually doing.docs.pytorch.org PyTorch FSDP tutorialSharded training: the all-gather / reduce-scatter dance that redefines your storage traffic.docs.pytorch.org FlashAttention (paper)The case study in memory-hierarchy thinking beating raw FLOPs.arxiv.org Slurm quickstartThe scheduler's own introduction — partitions, sbatch, srun in an afternoon.slurm.schedmd.com LLNL MPI tutorialWhere collective operations came from; the vocabulary NCCL inherited.hpc-tutorials.llnl.gov llm.cGPT-2 training in pure C/CUDA — the whole stack, small enough to actually read.github.com/karpathy Kubernetes v1.34 — DRA goes GAThe official announcement of attribute-aware GPU scheduling replacing the device-plugin era.kubernetes.io TGI docs (maintenance notice)Primary source for the TGI status and Hugging Face's recommended migration targets.huggingface.co Zero to AI EngineerThe author's structured study path from math foundations to distributed training.ai.m-khalifa.com The Storage Field GuideThe other half of this site's author: 21-platform storage APIs, replication, SAN tooling.storage.m-khalifa.com