GPU for AI Explained: VRAM, CUDA Cores, Tensor Cores, and Everything In Between¶
You've heard it countless times: "You need a GPU to train AI models." But why? What is a GPU actually doing that a CPU can't? What are CUDA Cores, Tensor Cores, and VRAM — and why do AI engineers obsess over these numbers?
This guide starts from scratch and builds a complete mental model of GPU hardware for AI. By the end, you'll understand exactly what's happening inside the chip when your model trains — and how to pick the right hardware for the job.
Part 1: Why AI Runs on GPUs — Not CPUs¶
The Core Problem: AI Is Matrix Math at Scale¶
At its heart, every neural network layer performs one operation over and over:
That weights × input is a matrix multiplication — a dense grid of multiply-add operations. Training a single transformer layer on a single batch involves billions of these operations.
A CPU is designed for serial, branching work — running your operating system, handling database transactions, executing business logic. It has 16–128 high-frequency cores, each optimized to run one complex task very fast.
A GPU is designed for parallel, uniform work — doing the same simple math across thousands of data points simultaneously.
CPU: 128 powerful cores
→ great at complex, sequential tasks
→ 128 matrix multiplications at once
GPU: 16,384 simpler cores
→ great at simple, repetitive tasks
→ 16,384 matrix multiplications at once
For matrix math, the GPU wins by orders of magnitude. A modern GPU can perform 1,000× more floating-point operations per second than a CPU on the same workload.
The Analogy¶
CPU = a team of 16 expert surgeons
Each can do complex, multi-step procedures.
Sequential, specialized.
GPU = a factory floor with 16,000 workers
Each does one simple repetitive action.
Parallel, coordinated.
Neural network training is a factory job, not surgery.
Part 2: Inside the GPU — The Architecture from Top to Bottom¶
Before diving into CUDA Cores and Tensor Cores, you need to understand the structural unit they live in: the Streaming Multiprocessor (SM).
The Streaming Multiprocessor (SM)¶
The SM is the fundamental building block of an NVIDIA GPU. Think of it as a mini-processor that contains multiple types of compute units, shared memory, and a scheduler.
┌─────────────────────────────────────────────────────┐
│ Streaming Multiprocessor (SM) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ CUDA │ │ Tensor │ │ Special Function│ │
│ │ Cores │ │ Cores │ │ Units (SFU) │ │
│ │ (FP32) │ │ (MMA) │ │ (sin, cos, etc) │ │
│ └──────────┘ └──────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Shared Memory / L1 Cache (192 KB on H100) │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Warp │ │ Register│ │ Load/Store │ │
│ │ Scheduler│ │ File │ │ Units (LSU) │ │
│ └──────────┘ └──────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────┘
A full GPU is a grid of SMs connected to a shared L2 cache and VRAM via a high-bandwidth memory bus:
GPU Die:
┌─────────────────────────────────────────────┐
│ SM SM SM SM SM SM SM SM SM SM │
│ SM SM SM SM SM SM SM SM SM SM │
│ ... (132 SMs on H100) │
│ │
│ ████████████ L2 Cache ████████████████ │
│ │
│ ════════════ Memory Bus (5120-bit) ════ │
│ │
│ HBM3 HBM3 HBM3 HBM3 HBM3 HBM3 (VRAM) │
└─────────────────────────────────────────────┘
Key GPU Specs and What They Mean¶
| Spec | What It Measures | Why It Matters |
|---|---|---|
| CUDA Core count | FP32 parallelism | General compute throughput |
| Tensor Core count | Matrix multiply speed | AI training/inference throughput |
| VRAM capacity (GB) | How much data fits on-chip | Model size limit |
| Memory bandwidth (TB/s) | How fast data moves to cores | Speed of memory-bound ops |
| TFLOPs | Trillion floating-point ops/sec | Peak compute performance |
| TDP (Watts) | Power draw | Cooling and infrastructure cost |
Part 3: CUDA Cores — The GPU's General-Purpose Workhorses¶
What Is a CUDA Core?¶
A CUDA Core is a single floating-point arithmetic unit — specifically, an FP32 (32-bit float) multiply-add unit. Each CUDA Core can perform one FP32 multiply-add per clock cycle.
An H100 GPU has 16,896 CUDA Cores. At 1.8 GHz base clock:
How CUDA Cores Execute Work: Warps and SIMT¶
CUDA Cores don't execute threads one by one — they use a model called SIMT (Single Instruction, Multiple Threads). The scheduler groups 32 threads into a warp and executes them all with the same instruction simultaneously.
Warp (32 threads):
Thread 0: output[0] = weight[0] × input[0]
Thread 1: output[1] = weight[1] × input[1]
Thread 2: output[2] = weight[2] × input[2]
...
Thread 31: output[31] = weight[31] × input[31]
All 32 threads run the SAME instruction,
on DIFFERENT data, at the SAME time.
This is why branching code is bad on GPUs: if half the threads in a warp take an if branch and half take else, the GPU must execute both paths sequentially for the whole warp — called warp divergence.
Occupancy: Keeping the Cores Busy¶
Occupancy is the ratio of active warps to the maximum possible warps on an SM. Low occupancy means cores are idle, waiting for memory. Maximizing occupancy is one of the key CUDA optimization techniques.
# PyTorch example: checking GPU utilization
import torch
# Check CUDA availability
print(torch.cuda.is_available()) # True
print(torch.cuda.get_device_name(0)) # NVIDIA H100 80GB HBM3
# Profile occupancy during a kernel
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA],
record_shapes=True,
) as prof:
x = torch.randn(4096, 4096, device='cuda')
y = torch.randn(4096, 4096, device='cuda')
z = torch.matmul(x, y)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=5))
Part 4: Tensor Cores — The AI Accelerator Inside the GPU¶
What Is a Tensor Core?¶
A Tensor Core is a specialized matrix multiply-accumulate (MMA) unit built specifically for AI workloads. While a CUDA Core handles one FP32 multiply-add per clock cycle, a Tensor Core performs an entire matrix-multiply-accumulate on small matrices in one cycle.
The operation a Tensor Core performs:
D = A × B + C
Where:
A = 16×16 matrix (input activations)
B = 16×16 matrix (weight matrix)
C = 16×16 matrix (accumulator / bias)
D = 16×16 matrix (output)
That's 16×16×16 = 4,096 multiply-adds in ONE OPERATION.
A CUDA Core does 1. A Tensor Core does 4,096.
CUDA Core vs Tensor Core: The Real Difference¶
CUDA Core:
1 multiply-add per clock
Works on scalars
Supports FP32, FP64, INT32
Tensor Core:
4,096 multiply-adds per clock (4th gen, FP16)
Works on matrices
Supports FP16, BF16, TF32, FP8, INT8, INT4
→ ~16× throughput vs CUDA Cores on matrix workloads
This is why Tensor Core TFLOPs are the number that matters for AI, not CUDA Core TFLOPs:
| GPU | CUDA Core TFLOPs (FP32) | Tensor Core TFLOPs (BF16) |
|---|---|---|
| RTX 4090 | 82.6 | 330 |
| A100 80GB | 19.5 | 312 |
| H100 SXM | 60 | 989 |
| H200 SXM | 60 | 989 |
| B200 SXM | ~90 | ~4,500 |
Tensor Core Precision Formats¶
Each generation of Tensor Cores adds new precision formats. Lower precision = more operations per second = faster training/inference:
FP32 (32-bit float): Full precision. Used for optimizer states.
TF32 (19-bit float): NVIDIA's "free" speedup — same range as FP32,
less mantissa. Default in PyTorch AMP.
BF16 (16-bit float): Same exponent range as FP32, less precision.
Training standard in 2026.
FP16 (16-bit float): Smaller range than BF16. Needs loss scaling.
FP8 (8-bit float): Used for inference and forward pass in 2026.
INT8 (8-bit int): Quantized inference. ~2× speedup over FP16.
INT4 (4-bit int): Ultra-compressed inference (GPTQ, AWQ).
Precision impact on throughput (H100, relative to FP32):
FP32 → 1× (baseline)
TF32 → 3×
BF16 → 16×
FP8 → 32×
INT8 → 32×
Why BF16 Beat FP16 for LLM Training¶
BF16 and FP16 both use 16 bits, but they split them differently:
FP32: 1 sign | 8 exponent | 23 mantissa
BF16: 1 sign | 8 exponent | 7 mantissa ← same exponent range as FP32
FP16: 1 sign | 5 exponent | 10 mantissa ← smaller range → overflow/underflow
BF16 keeps the wide dynamic range of FP32 (crucial for gradient stability)
while halving the memory footprint.
LLMs train in BF16 by default in 2026.
Using Tensor Cores in PyTorch¶
import torch
# Enable TF32 for matmul (default in PyTorch >= 1.12)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
# BF16 mixed precision training — activates Tensor Cores
from torch.cuda.amp import autocast, GradScaler
model = MyTransformer().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
# BF16 doesn't need a GradScaler (stable range)
for batch in dataloader:
optimizer.zero_grad()
with autocast(dtype=torch.bfloat16): # Tensor Cores kick in here
loss = model(batch)
loss.backward()
optimizer.step()
# Check that Tensor Cores are active:
# If input matrices are multiples of 16 (or 8 for FP8),
# cuBLAS will automatically route through Tensor Cores.
print(f"Input shape divisible by 16: {4096 % 16 == 0}") # True
The key rule: for Tensor Cores to activate, matrix dimensions must be multiples of 16 (BF16/FP16) or 8 (FP8/INT8). Pad your dimensions accordingly.
Part 5: VRAM — The Memory That Makes or Breaks Your Model¶
What Is VRAM?¶
VRAM (Video RAM) is the dedicated memory physically on the GPU die (or stacked directly on it). Unlike system RAM which must cross the PCIe bus to reach the GPU, VRAM is directly accessible at full memory bandwidth.
Everything the GPU works on must fit in VRAM: - Model weights - Activations (intermediate layer outputs) - Gradients (during training) - Optimizer states (Adam's m and v vectors) - KV cache (during LLM inference) - Input batch data
VRAM is the GPU's workspace.
If your model + data don't fit, training fails with:
RuntimeError: CUDA out of memory.
What Eats Your VRAM? (The Full Accounting)¶
For a model with P parameters, training with AdamW in BF16:
Weights: P × 2 bytes (BF16)
Gradients: P × 2 bytes (BF16)
Optimizer states: P × 8 bytes (Adam m + v in FP32 = 2 × 4 bytes each)
─────────────────────────────────────────────────────
Total for weights: P × 12 bytes
Plus activations (batch-size and sequence-length dependent):
Activations ≈ batch_size × seq_len × hidden_dim × num_layers × 2 bytes
Practical examples:
| Model | Parameters | Min VRAM (Inference BF16) | Min VRAM (Training BF16) |
|---|---|---|---|
| Llama 3.2 3B | 3B | ~6 GB | ~36 GB |
| Llama 3.1 8B | 8B | ~16 GB | ~96 GB |
| Llama 3.1 70B | 70B | ~140 GB | ~840 GB |
| GPT-4 class | ~1.8T | ~3.6 TB | Not feasible single-node |
# Estimate VRAM for inference
def vram_estimate_inference(params_billions, dtype_bytes=2):
"""BF16 = 2 bytes, FP32 = 4 bytes, INT8 = 1 byte"""
params = params_billions * 1e9
return (params * dtype_bytes) / (1024**3) # in GB
print(f"7B BF16: {vram_estimate_inference(7):.1f} GB") # 14.0 GB
print(f"13B BF16: {vram_estimate_inference(13):.1f} GB") # 26.0 GB
print(f"70B INT4: {vram_estimate_inference(70, 0.5):.1f} GB") # 35.0 GB
VRAM Types: GDDR6X vs HBM3¶
Not all VRAM is equal. Consumer GPUs use GDDR6X; data center GPUs use HBM (High Bandwidth Memory):
GDDR6X (RTX 4090):
Capacity: 24 GB
Bandwidth: 1,008 GB/s
Location: Separate chips on PCB, connected via 384-bit bus
HBM3 (H100 SXM):
Capacity: 80 GB
Bandwidth: 3,350 GB/s
Location: Stacked directly beside/on the GPU die (2.5D/3D packaging)
HBM is stacked in layers like a skyscraper, with thousands of tiny connections to the GPU die. This is why H100 bandwidth is 3.3× the RTX 4090 despite fewer CUDA Cores for gaming workloads.
Memory Bandwidth: The Hidden Bottleneck¶
For many AI workloads — especially inference and attention — the bottleneck isn't compute (Tensor Cores) but memory bandwidth: how fast you can move model weights from VRAM to the compute units.
Roofline model:
Arithmetic Intensity (AI) = FLOPs ÷ bytes_accessed
If your operation's AI is BELOW the hardware ridge point:
→ Memory-bound: bandwidth is the limit
If your operation's AI is ABOVE the hardware ridge point:
→ Compute-bound: Tensor Cores are the limit
LLM inference (small batch): Memory-bound
LLM training (large batch): Compute-bound
Attention with long sequences: Memory-bound
This is why FlashAttention exists — it reorders computation to reduce memory accesses, turning an attention mechanism from memory-bound to compute-bound.
# FlashAttention in PyTorch 2.x (uses SDPA with Flash under the hood)
import torch
import torch.nn.functional as F
q = torch.randn(1, 32, 2048, 128, device='cuda', dtype=torch.bfloat16)
k = torch.randn(1, 32, 2048, 128, device='cuda', dtype=torch.bfloat16)
v = torch.randn(1, 32, 2048, 128, device='cuda', dtype=torch.bfloat16)
# PyTorch 2.0+ automatically uses FlashAttention when available
with torch.backends.cuda.sdp_kernel(
enable_flash=True,
enable_math=False,
enable_mem_efficient=False
):
output = F.scaled_dot_product_attention(q, k, v)
# Result: same output, 2-4× faster, much less VRAM (no N×N attention matrix stored)
Part 6: GPU Generations and the 2026 Landscape¶
NVIDIA Architecture Timeline¶
Ampere (2020) → A100: 312 TFLOPS BF16, 80GB HBM2e, 2TB/s bandwidth
The workhorse of 2021–2024 cloud AI
Hopper (2022) → H100: 989 TFLOPS BF16, 80GB HBM3, 3.35TB/s bandwidth
NVLink 4.0, Transformer Engine (FP8 dynamic scaling)
→ H200: 989 TFLOPS BF16, 141GB HBM3e, 4.8TB/s bandwidth
Same compute as H100, ~2× more memory & bandwidth
Blackwell(2024)→ B100: 1,800 TFLOPS BF16, 192GB HBM3e, 8TB/s bandwidth
→ B200: 4,500 TFLOPS BF16, 192GB HBM3e, 8TB/s bandwidth
→ GB200: B200 GPU + Grace ARM CPU in one NVL72 rack system
The Transformer Engine (H100 and Later)¶
Starting with Hopper, NVIDIA added the Transformer Engine — hardware that dynamically selects FP8 or BF16 precision per layer during training with no manual tuning:
Transformer Engine:
→ Monitors per-tensor statistics each forward pass
→ Automatically scales to FP8 where safe
→ Falls back to BF16 where precision is critical
→ 2× throughput vs BF16-only training, no accuracy loss
Enabled in PyTorch automatically when:
- GPU is H100 or later
- Using torch.float8_e4m3fn or transformer-engine library
Consumer vs Data Center: What's the Difference?¶
| Feature | RTX 4090 (Consumer) | H100 SXM (Data Center) |
|---|---|---|
| Price | ~$1,600 | ~$30,000 |
| VRAM | 24 GB GDDR6X | 80 GB HBM3 |
| Bandwidth | 1.0 TB/s | 3.35 TB/s |
| TFLOPs (BF16) | 330 | 989 |
| FP64 (for science) | 1.5 TFLOPS | 34 TFLOPS |
| NVLink | ❌ | ✅ (900 GB/s) |
| ECC memory | ❌ | ✅ |
| PCIe form factor | Consumer slot | SXM5 board |
| Best for | Hobbyist training, inference | Production LLM training |
NVLink and Multi-GPU Scaling¶
When one GPU isn't enough, you need multiple GPUs to communicate. The interconnect matters enormously:
PCIe 4.0 (×16): ~32 GB/s bidirectional
Consumer GPUs connected to the CPU and each other
via the CPU's PCIe lanes
NVLink 4.0: 900 GB/s bidirectional (per GPU pair)
Direct GPU-to-GPU high-speed link
28× faster than PCIe
NVSwitch (DGX H100): All 8 GPUs connected all-to-all
Every GPU sees every other GPU at full NVLink speed
= 3.6 TB/s all-reduce for 8-GPU training
The practical impact on distributed training:
8× A100 via PCIe: all-reduce bottleneck = 32 GB/s
→ data parallelism poorly scaled
8× H100 via NVLink: all-reduce bottleneck = 900 GB/s
→ near-linear scaling in data parallelism
Part 7: Practical Guide — Choosing the Right GPU¶
For Inference: Maximize Bandwidth-per-Dollar¶
Inference is typically memory-bound. You want the highest memory bandwidth you can afford at the VRAM capacity you need.
Rule of thumb (inference, BF16 weights):
Model params × 2 = minimum VRAM in GB
Llama 3.1 8B → 16 GB min → RTX 4080 (16GB) or better
Llama 3.1 70B → 140 GB min → 2× H100 80GB (160GB total)
Llama 3.1 70B → 40 GB INT4 → 2× A100 40GB with GPTQ/AWQ
# Run quantized inference to fit larger models
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# INT4 quantization — fits 70B in ~40GB VRAM
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16, # Tensor Cores for compute
bnb_4bit_quant_type="nf4", # NormalFloat4: best accuracy
bnb_4bit_use_double_quant=True, # Extra compression
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B-Instruct",
quantization_config=quant_config,
device_map="auto",
)
# 70B model now fits in ~40GB VRAM with <2% accuracy drop
For Training: Maximize TFLOPS and VRAM Capacity¶
Training is often compute-bound for large batches. Use BF16, enable Tensor Cores, and scale with gradient checkpointing to trade compute for memory.
import torch
from torch.utils.checkpoint import checkpoint_sequential
model = MyLargeTransformer(num_layers=48).cuda()
# Gradient checkpointing: recompute activations on backward pass
# Trades ~33% extra compute for ~60% less VRAM
model.gradient_checkpointing_enable()
# Combine with BF16 mixed precision
from torch.cuda.amp import autocast
optimizer = torch.optim.AdamW(model.parameters())
for batch in dataloader:
with autocast(dtype=torch.bfloat16):
loss = model(**batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
For Fine-Tuning: Use LoRA to Slash VRAM¶
Full fine-tuning of a 7B model requires ~84GB VRAM. LoRA (Low-Rank Adaptation) reduces this to ~16GB:
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.bfloat16, # Half precision weights
device_map="auto",
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank: low-rank matrices
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Which layers to adapt
lora_dropout=0.1,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 8,037,384,192
# trainable%: 0.08% ← only 0.08% of weights are trained!
GPU Decision Matrix for 2026¶
| Use Case | VRAM Needed | Recommended GPU | Approx Cost |
|---|---|---|---|
| Learning / experiments | 8–16 GB | RTX 4070 Ti (16GB) | $800 |
| 7B–13B inference | 16–24 GB | RTX 4090 (24GB) | $1,600 |
| 7B fine-tuning (LoRA) | 24 GB | RTX 4090 | $1,600 |
| 70B inference (quantized) | 48 GB | 2× RTX 4090 | $3,200 |
| 7B full fine-tuning | 80 GB | H100 (cloud) | ~$3/hr |
| 70B full fine-tuning | 640 GB+ | 8× H100 DGX | ~$25/hr |
| LLM pre-training | TB-scale | H100/B200 cluster | Enterprise |
Part 8: GPU Memory Management in Practice¶
Monitoring VRAM in Real Time¶
import torch
def vram_status():
if not torch.cuda.is_available():
return "No CUDA GPU available"
allocated = torch.cuda.memory_allocated() / 1024**3
reserved = torch.cuda.memory_reserved() / 1024**3
total = torch.cuda.get_device_properties(0).total_memory / 1024**3
return (
f"Allocated: {allocated:.2f} GB\n"
f"Reserved: {reserved:.2f} GB\n"
f"Total: {total:.2f} GB\n"
f"Free: {total - reserved:.2f} GB"
)
print(vram_status())
# After a CUDA OOM, clear cache before retrying
torch.cuda.empty_cache()
Common CUDA OOM Patterns and Fixes¶
# ❌ Problem 1: Accumulating computation graph across batches
losses = []
for batch in dataloader:
loss = model(batch)
losses.append(loss) # Keeps the entire graph in VRAM!
total = sum(losses)
# ✅ Fix: Detach the scalar value
losses.append(loss.item()) # .item() breaks the graph
# ❌ Problem 2: Not freeing VRAM between evaluation and training
with torch.no_grad():
val_output = model(val_batch)
# val_output still lives in VRAM
# ✅ Fix: Delete explicitly or scope tightly
del val_output
torch.cuda.empty_cache()
# ❌ Problem 3: Full precision where half is fine
embeddings = model.embed(text) # FP32 by default
# ✅ Fix: Move model to BF16 upfront
model = model.to(torch.bfloat16)
Part 9: The 2026 AI Hardware Landscape at a Glance¶
NVIDIA's Blackwell Generation¶
The B200 and GB200 NVL72 are the dominant training platforms in 2026:
B200 SXM (single GPU):
Tensor Cores TFLOPs: ~4,500 BF16 / ~9,000 FP8
VRAM: 192 GB HBM3e
Bandwidth: 8 TB/s
NVLink 5.0: 1.8 TB/s bidirectional
GB200 NVL72 (full rack):
72× B200 GPUs + 36× Grace CPUs
Total VRAM: 13.8 TB
All-reduce bandwidth: 130 TB/s
Used for: GPT-5 class pre-training
AMD MI300X — The Challenger¶
AMD's MI300X has made significant inroads in 2026:
MI300X:
VRAM: 192 GB HBM3 ← most of any single-chip GPU (tied with B200)
Bandwidth: 5.3 TB/s
TFLOPs: 1,307 BF16
Best at: inference of very large models due to massive VRAM
Software: ROCm 6.x, PyTorch 2.x full support
Cloud GPU Availability (2026)¶
| Provider | Available GPUs | Best For |
|---|---|---|
| AWS | H100, H200, A100 | Training at scale |
| Google Cloud | H100, A100, TPU v5 | JAX/TensorFlow workloads |
| Azure | H100, ND B200 (2026) | Enterprise, OpenAI partnership |
| Lambda Labs | H100, A100 | Cost-effective training |
| RunPod | H100, A100, RTX 4090 | Budget inference / fine-tuning |
Summary¶
GPUs dominate AI because neural networks are matrix math at massive scale, and GPUs are hardware-optimized for exactly that.
CUDA Cores are the general-purpose FP32 arithmetic units — thousands of them executing the same instruction across thousands of data points simultaneously via the SIMT model and warp execution. They handle all non-matrix operations and are the fallback when Tensor Cores can't be used.
Tensor Cores are the real AI accelerators inside the GPU — matrix multiply-accumulate units that perform 4,096 multiply-adds in a single clock cycle instead of one. They are the reason an H100 achieves 989 TFLOPS BF16 rather than 60. They activate automatically when matrix dimensions are multiples of 16 and when using BF16, FP16, TF32, or FP8 — which is why matching these dimensions in your model is not optional, it's critical.
VRAM is the workspace every GPU operation draws from. Capacity determines what model fits; bandwidth determines how fast computation flows. HBM (used in data center GPUs) achieves 3–8× the bandwidth of GDDR6X (consumer) by physically stacking memory beside the die. The single most important GPU metric for LLM inference is memory bandwidth; for training at large batch sizes it is Tensor Core TFLOPs.
In 2026, the stack has matured: BF16 is the training default, FP8 is standard for forward passes on H100+ hardware, LoRA makes fine-tuning accessible on a single consumer GPU, and FlashAttention removes the memory bottleneck from attention. Whether you're training from scratch on a B200 cluster or running a quantized 70B model on two RTX 4090s, the same principles apply — keep your Tensor Cores fed, keep your VRAM footprint minimal, and understand which dimension of the hardware you're hitting first.
Questions or discussion? Connect on LinkedIn, X or reach out via email.
Discussion
Have thoughts on this post? Share them below — questions, corrections, or your own experience are all welcome.