GPU for AI Explained: VRAM, CUDA Cores, Tensor Cores, and Everything In Between¶

You've heard it countless times: "You need a GPU to train AI models." But why? What is a GPU actually doing that a CPU can't? What are CUDA Cores, Tensor Cores, and VRAM — and why do AI engineers obsess over these numbers?

This guide starts from scratch and builds a complete mental model of GPU hardware for AI. By the end, you'll understand exactly what's happening inside the chip when your model trains — and how to pick the right hardware for the job.

Part 1: Why AI Runs on GPUs — Not CPUs¶

The Core Problem: AI Is Matrix Math at Scale¶

At its heart, every neural network layer performs one operation over and over:

output = activation(weights × input + bias)

That weights × input is a matrix multiplication — a dense grid of multiply-add operations. Training a single transformer layer on a single batch involves billions of these operations.

A CPU is designed for serial, branching work — running your operating system, handling database transactions, executing business logic. It has 16–128 high-frequency cores, each optimized to run one complex task very fast.

A GPU is designed for parallel, uniform work — doing the same simple math across thousands of data points simultaneously.

CPU: 128 powerful cores
     → great at complex, sequential tasks
     → 128 matrix multiplications at once

GPU: 16,384 simpler cores
     → great at simple, repetitive tasks
     → 16,384 matrix multiplications at once

For matrix math, the GPU wins by orders of magnitude. A modern GPU can perform 1,000× more floating-point operations per second than a CPU on the same workload.

The Analogy¶

CPU = a team of 16 expert surgeons
      Each can do complex, multi-step procedures.
      Sequential, specialized.

GPU = a factory floor with 16,000 workers
      Each does one simple repetitive action.
      Parallel, coordinated.

Neural network training is a factory job, not surgery.

Part 2: Inside the GPU — The Architecture from Top to Bottom¶

Before diving into CUDA Cores and Tensor Cores, you need to understand the structural unit they live in: the Streaming Multiprocessor (SM).

The Streaming Multiprocessor (SM)¶

The SM is the fundamental building block of an NVIDIA GPU. Think of it as a mini-processor that contains multiple types of compute units, shared memory, and a scheduler.

┌─────────────────────────────────────────────────────┐
│              Streaming Multiprocessor (SM)           │
│                                                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────┐  │
│  │  CUDA    │  │  Tensor  │  │  Special Function│  │
│  │  Cores   │  │  Cores   │  │  Units (SFU)     │  │
│  │ (FP32)   │  │ (MMA)    │  │  (sin, cos, etc) │  │
│  └──────────┘  └──────────┘  └──────────────────┘  │
│                                                     │
│  ┌──────────────────────────────────────────────┐   │
│  │  Shared Memory / L1 Cache  (192 KB on H100)  │   │
│  └──────────────────────────────────────────────┘   │
│                                                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────┐  │
│  │  Warp    │  │  Register│  │  Load/Store      │  │
│  │ Scheduler│  │  File    │  │  Units (LSU)     │  │
│  └──────────┘  └──────────┘  └──────────────────┘  │
└─────────────────────────────────────────────────────┘

A full GPU is a grid of SMs connected to a shared L2 cache and VRAM via a high-bandwidth memory bus:

GPU Die:
┌─────────────────────────────────────────────┐
│  SM  SM  SM  SM  SM  SM  SM  SM  SM  SM     │
│  SM  SM  SM  SM  SM  SM  SM  SM  SM  SM     │
│           ...  (132 SMs on H100)            │
│                                             │
│  ████████████  L2 Cache  ████████████████  │
│                                             │
│  ════════════  Memory Bus (5120-bit)  ════  │
│                                             │
│  HBM3 HBM3 HBM3 HBM3 HBM3 HBM3  (VRAM)    │
└─────────────────────────────────────────────┘

Key GPU Specs and What They Mean¶

Spec	What It Measures	Why It Matters
CUDA Core count	FP32 parallelism	General compute throughput
Tensor Core count	Matrix multiply speed	AI training/inference throughput
VRAM capacity (GB)	How much data fits on-chip	Model size limit
Memory bandwidth (TB/s)	How fast data moves to cores	Speed of memory-bound ops
TFLOPs	Trillion floating-point ops/sec	Peak compute performance
TDP (Watts)	Power draw	Cooling and infrastructure cost

Part 3: CUDA Cores — The GPU's General-Purpose Workhorses¶

What Is a CUDA Core?¶

A CUDA Core is a single floating-point arithmetic unit — specifically, an FP32 (32-bit float) multiply-add unit. Each CUDA Core can perform one FP32 multiply-add per clock cycle.

One CUDA Core, one clock cycle:
  Input:   a = 3.14,  b = 2.71,  c = 1.0
  Output:  a × b + c = 9.51

An H100 GPU has 16,896 CUDA Cores. At 1.8 GHz base clock:

16,896 cores × 1.8 GHz × 2 ops/cycle (FMA = multiply + add)
= ~60 TFLOPS (FP32)

How CUDA Cores Execute Work: Warps and SIMT¶

CUDA Cores don't execute threads one by one — they use a model called SIMT (Single Instruction, Multiple Threads). The scheduler groups 32 threads into a warp and executes them all with the same instruction simultaneously.

Warp (32 threads):
  Thread  0:  output[0]  = weight[0]  × input[0]
  Thread  1:  output[1]  = weight[1]  × input[1]
  Thread  2:  output[2]  = weight[2]  × input[2]
  ...
  Thread 31:  output[31] = weight[31] × input[31]

All 32 threads run the SAME instruction,
on DIFFERENT data, at the SAME time.

This is why branching code is bad on GPUs: if half the threads in a warp take an if branch and half take else, the GPU must execute both paths sequentially for the whole warp — called warp divergence.

Occupancy: Keeping the Cores Busy¶

Occupancy is the ratio of active warps to the maximum possible warps on an SM. Low occupancy means cores are idle, waiting for memory. Maximizing occupancy is one of the key CUDA optimization techniques.

# PyTorch example: checking GPU utilization
import torch

# Check CUDA availability
print(torch.cuda.is_available())        # True
print(torch.cuda.get_device_name(0))    # NVIDIA H100 80GB HBM3

# Profile occupancy during a kernel
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True,
) as prof:
    x = torch.randn(4096, 4096, device='cuda')
    y = torch.randn(4096, 4096, device='cuda')
    z = torch.matmul(x, y)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=5))

Part 4: Tensor Cores — The AI Accelerator Inside the GPU¶

What Is a Tensor Core?¶

A Tensor Core is a specialized matrix multiply-accumulate (MMA) unit built specifically for AI workloads. While a CUDA Core handles one FP32 multiply-add per clock cycle, a Tensor Core performs an entire matrix-multiply-accumulate on small matrices in one cycle.

The operation a Tensor Core performs:

D = A × B + C

Where:
  A = 16×16 matrix  (input activations)
  B = 16×16 matrix  (weight matrix)
  C = 16×16 matrix  (accumulator / bias)
  D = 16×16 matrix  (output)

That's 16×16×16 = 4,096 multiply-adds in ONE OPERATION.
A CUDA Core does 1. A Tensor Core does 4,096.

CUDA Core vs Tensor Core: The Real Difference¶

CUDA Core:
  1 multiply-add per clock
  Works on scalars
  Supports FP32, FP64, INT32

Tensor Core:
  4,096 multiply-adds per clock (4th gen, FP16)
  Works on matrices
  Supports FP16, BF16, TF32, FP8, INT8, INT4
  → ~16× throughput vs CUDA Cores on matrix workloads

This is why Tensor Core TFLOPs are the number that matters for AI, not CUDA Core TFLOPs:

GPU	CUDA Core TFLOPs (FP32)	Tensor Core TFLOPs (BF16)
RTX 4090	82.6	330
A100 80GB	19.5	312
H100 SXM	60	989
H200 SXM	60	989
B200 SXM	~90	~4,500

Tensor Core Precision Formats¶

Each generation of Tensor Cores adds new precision formats. Lower precision = more operations per second = faster training/inference:

FP32  (32-bit float): Full precision. Used for optimizer states.
TF32  (19-bit float): NVIDIA's "free" speedup — same range as FP32,
                      less mantissa. Default in PyTorch AMP.
BF16  (16-bit float): Same exponent range as FP32, less precision.
                      Training standard in 2026.
FP16  (16-bit float): Smaller range than BF16. Needs loss scaling.
FP8   (8-bit float):  Used for inference and forward pass in 2026.
INT8  (8-bit int):    Quantized inference. ~2× speedup over FP16.
INT4  (4-bit int):    Ultra-compressed inference (GPTQ, AWQ).

Precision impact on throughput (H100, relative to FP32):
  FP32  →  1×   (baseline)
  TF32  →  3×
  BF16  →  16×
  FP8   →  32×
  INT8  →  32×

Why BF16 Beat FP16 for LLM Training¶

BF16 and FP16 both use 16 bits, but they split them differently:

FP32:  1 sign | 8 exponent | 23 mantissa
BF16:  1 sign | 8 exponent |  7 mantissa  ← same exponent range as FP32
FP16:  1 sign | 5 exponent | 10 mantissa  ← smaller range → overflow/underflow

BF16 keeps the wide dynamic range of FP32 (crucial for gradient stability)
while halving the memory footprint.
LLMs train in BF16 by default in 2026.

Using Tensor Cores in PyTorch¶

import torch

# Enable TF32 for matmul (default in PyTorch >= 1.12)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# BF16 mixed precision training — activates Tensor Cores
from torch.cuda.amp import autocast, GradScaler

model = MyTransformer().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

# BF16 doesn't need a GradScaler (stable range)
for batch in dataloader:
    optimizer.zero_grad()
    with autocast(dtype=torch.bfloat16):   # Tensor Cores kick in here
        loss = model(batch)
    loss.backward()
    optimizer.step()

# Check that Tensor Cores are active:
# If input matrices are multiples of 16 (or 8 for FP8),
# cuBLAS will automatically route through Tensor Cores.
print(f"Input shape divisible by 16: {4096 % 16 == 0}")  # True

The key rule: for Tensor Cores to activate, matrix dimensions must be multiples of 16 (BF16/FP16) or 8 (FP8/INT8). Pad your dimensions accordingly.

Part 5: VRAM — The Memory That Makes or Breaks Your Model¶

What Is VRAM?¶

VRAM (Video RAM) is the dedicated memory physically on the GPU die (or stacked directly on it). Unlike system RAM which must cross the PCIe bus to reach the GPU, VRAM is directly accessible at full memory bandwidth.

Everything the GPU works on must fit in VRAM: - Model weights - Activations (intermediate layer outputs) - Gradients (during training) - Optimizer states (Adam's m and v vectors) - KV cache (during LLM inference) - Input batch data

VRAM is the GPU's workspace.
If your model + data don't fit, training fails with:
  RuntimeError: CUDA out of memory.

What Eats Your VRAM? (The Full Accounting)¶

For a model with P parameters, training with AdamW in BF16:

Weights:           P × 2 bytes  (BF16)
Gradients:         P × 2 bytes  (BF16)
Optimizer states:  P × 8 bytes  (Adam m + v in FP32 = 2 × 4 bytes each)
─────────────────────────────────────────────────────
Total for weights: P × 12 bytes

Plus activations (batch-size and sequence-length dependent):
Activations ≈ batch_size × seq_len × hidden_dim × num_layers × 2 bytes

Practical examples:

Model	Parameters	Min VRAM (Inference BF16)	Min VRAM (Training BF16)
Llama 3.2 3B	3B	~6 GB	~36 GB
Llama 3.1 8B	8B	~16 GB	~96 GB
Llama 3.1 70B	70B	~140 GB	~840 GB
GPT-4 class	~1.8T	~3.6 TB	Not feasible single-node

# Estimate VRAM for inference
def vram_estimate_inference(params_billions, dtype_bytes=2):
    """BF16 = 2 bytes, FP32 = 4 bytes, INT8 = 1 byte"""
    params = params_billions * 1e9
    return (params * dtype_bytes) / (1024**3)  # in GB

print(f"7B  BF16: {vram_estimate_inference(7):.1f} GB")    # 14.0 GB
print(f"13B BF16: {vram_estimate_inference(13):.1f} GB")   # 26.0 GB
print(f"70B INT4: {vram_estimate_inference(70, 0.5):.1f} GB")  # 35.0 GB

VRAM Types: GDDR6X vs HBM3¶

Not all VRAM is equal. Consumer GPUs use GDDR6X; data center GPUs use HBM (High Bandwidth Memory):

GDDR6X (RTX 4090):
  Capacity:  24 GB
  Bandwidth: 1,008 GB/s
  Location:  Separate chips on PCB, connected via 384-bit bus

HBM3 (H100 SXM):
  Capacity:  80 GB
  Bandwidth: 3,350 GB/s
  Location:  Stacked directly beside/on the GPU die (2.5D/3D packaging)

HBM is stacked in layers like a skyscraper, with thousands of tiny connections to the GPU die. This is why H100 bandwidth is 3.3× the RTX 4090 despite fewer CUDA Cores for gaming workloads.

Memory Bandwidth: The Hidden Bottleneck¶

For many AI workloads — especially inference and attention — the bottleneck isn't compute (Tensor Cores) but memory bandwidth: how fast you can move model weights from VRAM to the compute units.

Roofline model:
  Arithmetic Intensity (AI) = FLOPs ÷ bytes_accessed

  If your operation's AI is BELOW the hardware ridge point:
    → Memory-bound: bandwidth is the limit
  If your operation's AI is ABOVE the hardware ridge point:
    → Compute-bound: Tensor Cores are the limit

LLM inference (small batch):  Memory-bound
LLM training (large batch):   Compute-bound
Attention with long sequences: Memory-bound

This is why FlashAttention exists — it reorders computation to reduce memory accesses, turning an attention mechanism from memory-bound to compute-bound.

# FlashAttention in PyTorch 2.x (uses SDPA with Flash under the hood)
import torch
import torch.nn.functional as F

q = torch.randn(1, 32, 2048, 128, device='cuda', dtype=torch.bfloat16)
k = torch.randn(1, 32, 2048, 128, device='cuda', dtype=torch.bfloat16)
v = torch.randn(1, 32, 2048, 128, device='cuda', dtype=torch.bfloat16)

# PyTorch 2.0+ automatically uses FlashAttention when available
with torch.backends.cuda.sdp_kernel(
    enable_flash=True,
    enable_math=False,
    enable_mem_efficient=False
):
    output = F.scaled_dot_product_attention(q, k, v)

# Result: same output, 2-4× faster, much less VRAM (no N×N attention matrix stored)

Part 6: GPU Generations and the 2026 Landscape¶

NVIDIA Architecture Timeline¶

Ampere  (2020) → A100:  312 TFLOPS BF16, 80GB HBM2e, 2TB/s bandwidth
                         The workhorse of 2021–2024 cloud AI
Hopper  (2022) → H100:  989 TFLOPS BF16, 80GB HBM3,  3.35TB/s bandwidth
                         NVLink 4.0, Transformer Engine (FP8 dynamic scaling)
                → H200:  989 TFLOPS BF16, 141GB HBM3e, 4.8TB/s bandwidth
                         Same compute as H100, ~2× more memory & bandwidth
Blackwell(2024)→ B100:  1,800 TFLOPS BF16, 192GB HBM3e, 8TB/s bandwidth
                → B200:  4,500 TFLOPS BF16, 192GB HBM3e, 8TB/s bandwidth
                → GB200: B200 GPU + Grace ARM CPU in one NVL72 rack system

The Transformer Engine (H100 and Later)¶

Starting with Hopper, NVIDIA added the Transformer Engine — hardware that dynamically selects FP8 or BF16 precision per layer during training with no manual tuning:

Transformer Engine:
  → Monitors per-tensor statistics each forward pass
  → Automatically scales to FP8 where safe
  → Falls back to BF16 where precision is critical
  → 2× throughput vs BF16-only training, no accuracy loss

Enabled in PyTorch automatically when:
  - GPU is H100 or later
  - Using torch.float8_e4m3fn or transformer-engine library

Consumer vs Data Center: What's the Difference?¶

Feature	RTX 4090 (Consumer)	H100 SXM (Data Center)
Price	~$1,600	~$30,000
VRAM	24 GB GDDR6X	80 GB HBM3
Bandwidth	1.0 TB/s	3.35 TB/s
TFLOPs (BF16)	330	989
FP64 (for science)	1.5 TFLOPS	34 TFLOPS
NVLink	❌	✅ (900 GB/s)
ECC memory	❌	✅
PCIe form factor	Consumer slot	SXM5 board
Best for	Hobbyist training, inference	Production LLM training

NVLink and Multi-GPU Scaling¶

When one GPU isn't enough, you need multiple GPUs to communicate. The interconnect matters enormously:

PCIe 4.0 (×16):   ~32 GB/s bidirectional
                   Consumer GPUs connected to the CPU and each other
                   via the CPU's PCIe lanes

NVLink 4.0:        900 GB/s bidirectional (per GPU pair)
                   Direct GPU-to-GPU high-speed link
                   28× faster than PCIe

NVSwitch (DGX H100):  All 8 GPUs connected all-to-all
                      Every GPU sees every other GPU at full NVLink speed
                      = 3.6 TB/s all-reduce for 8-GPU training

The practical impact on distributed training:

8× A100 via PCIe:    all-reduce bottleneck = 32 GB/s
                     → data parallelism poorly scaled

8× H100 via NVLink:  all-reduce bottleneck = 900 GB/s
                     → near-linear scaling in data parallelism

Part 7: Practical Guide — Choosing the Right GPU¶

For Inference: Maximize Bandwidth-per-Dollar¶

Inference is typically memory-bound. You want the highest memory bandwidth you can afford at the VRAM capacity you need.

Rule of thumb (inference, BF16 weights):
  Model params × 2 = minimum VRAM in GB

  Llama 3.1 8B  → 16 GB min → RTX 4080 (16GB) or better
  Llama 3.1 70B → 140 GB min → 2× H100 80GB (160GB total)
  Llama 3.1 70B → 40 GB INT4 → 2× A100 40GB with GPTQ/AWQ

# Run quantized inference to fit larger models
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# INT4 quantization — fits 70B in ~40GB VRAM
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,   # Tensor Cores for compute
    bnb_4bit_quant_type="nf4",               # NormalFloat4: best accuracy
    bnb_4bit_use_double_quant=True,          # Extra compression
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    quantization_config=quant_config,
    device_map="auto",
)
# 70B model now fits in ~40GB VRAM with <2% accuracy drop

For Training: Maximize TFLOPS and VRAM Capacity¶

Training is often compute-bound for large batches. Use BF16, enable Tensor Cores, and scale with gradient checkpointing to trade compute for memory.

import torch
from torch.utils.checkpoint import checkpoint_sequential

model = MyLargeTransformer(num_layers=48).cuda()

# Gradient checkpointing: recompute activations on backward pass
# Trades ~33% extra compute for ~60% less VRAM
model.gradient_checkpointing_enable()

# Combine with BF16 mixed precision
from torch.cuda.amp import autocast
optimizer = torch.optim.AdamW(model.parameters())

for batch in dataloader:
    with autocast(dtype=torch.bfloat16):
        loss = model(**batch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

For Fine-Tuning: Use LoRA to Slash VRAM¶

Full fine-tuning of a 7B model requires ~84GB VRAM. LoRA (Low-Rank Adaptation) reduces this to ~16GB:

from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.bfloat16,   # Half precision weights
    device_map="auto",
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank: low-rank matrices
    lora_alpha=32,                 # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.1,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 8,037,384,192
# trainable%: 0.08%  ← only 0.08% of weights are trained!

GPU Decision Matrix for 2026¶

Use Case	VRAM Needed	Recommended GPU	Approx Cost
Learning / experiments	8–16 GB	RTX 4070 Ti (16GB)	$800
7B–13B inference	16–24 GB	RTX 4090 (24GB)	$1,600
7B fine-tuning (LoRA)	24 GB	RTX 4090	$1,600
70B inference (quantized)	48 GB	2× RTX 4090	$3,200
7B full fine-tuning	80 GB	H100 (cloud)	~$3/hr
70B full fine-tuning	640 GB+	8× H100 DGX	~$25/hr
LLM pre-training	TB-scale	H100/B200 cluster	Enterprise

Part 8: GPU Memory Management in Practice¶

Monitoring VRAM in Real Time¶

import torch

def vram_status():
    if not torch.cuda.is_available():
        return "No CUDA GPU available"
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved  = torch.cuda.memory_reserved()  / 1024**3
    total     = torch.cuda.get_device_properties(0).total_memory / 1024**3
    return (
        f"Allocated: {allocated:.2f} GB\n"
        f"Reserved:  {reserved:.2f} GB\n"
        f"Total:     {total:.2f} GB\n"
        f"Free:      {total - reserved:.2f} GB"
    )

print(vram_status())

# After a CUDA OOM, clear cache before retrying
torch.cuda.empty_cache()

Common CUDA OOM Patterns and Fixes¶

# ❌ Problem 1: Accumulating computation graph across batches
losses = []
for batch in dataloader:
    loss = model(batch)
    losses.append(loss)          # Keeps the entire graph in VRAM!
total = sum(losses)

# ✅ Fix: Detach the scalar value
losses.append(loss.item())       # .item() breaks the graph

# ❌ Problem 2: Not freeing VRAM between evaluation and training
with torch.no_grad():
    val_output = model(val_batch)
# val_output still lives in VRAM

# ✅ Fix: Delete explicitly or scope tightly
del val_output
torch.cuda.empty_cache()

# ❌ Problem 3: Full precision where half is fine
embeddings = model.embed(text)   # FP32 by default

# ✅ Fix: Move model to BF16 upfront
model = model.to(torch.bfloat16)

Part 9: The 2026 AI Hardware Landscape at a Glance¶

NVIDIA's Blackwell Generation¶

The B200 and GB200 NVL72 are the dominant training platforms in 2026:

B200 SXM (single GPU):
  Tensor Cores TFLOPs: ~4,500 BF16 / ~9,000 FP8
  VRAM: 192 GB HBM3e
  Bandwidth: 8 TB/s
  NVLink 5.0: 1.8 TB/s bidirectional

GB200 NVL72 (full rack):
  72× B200 GPUs + 36× Grace CPUs
  Total VRAM: 13.8 TB
  All-reduce bandwidth: 130 TB/s
  Used for: GPT-5 class pre-training

AMD MI300X — The Challenger¶

AMD's MI300X has made significant inroads in 2026:

MI300X:
  VRAM: 192 GB HBM3  ← most of any single-chip GPU (tied with B200)
  Bandwidth: 5.3 TB/s
  TFLOPs: 1,307 BF16
  Best at: inference of very large models due to massive VRAM
  Software: ROCm 6.x, PyTorch 2.x full support

Cloud GPU Availability (2026)¶

Provider	Available GPUs	Best For
AWS	H100, H200, A100	Training at scale
Google Cloud	H100, A100, TPU v5	JAX/TensorFlow workloads
Azure	H100, ND B200 (2026)	Enterprise, OpenAI partnership
Lambda Labs	H100, A100	Cost-effective training
RunPod	H100, A100, RTX 4090	Budget inference / fine-tuning

Summary¶

GPUs dominate AI because neural networks are matrix math at massive scale, and GPUs are hardware-optimized for exactly that.

CUDA Cores are the general-purpose FP32 arithmetic units — thousands of them executing the same instruction across thousands of data points simultaneously via the SIMT model and warp execution. They handle all non-matrix operations and are the fallback when Tensor Cores can't be used.

Tensor Cores are the real AI accelerators inside the GPU — matrix multiply-accumulate units that perform 4,096 multiply-adds in a single clock cycle instead of one. They are the reason an H100 achieves 989 TFLOPS BF16 rather than 60. They activate automatically when matrix dimensions are multiples of 16 and when using BF16, FP16, TF32, or FP8 — which is why matching these dimensions in your model is not optional, it's critical.

VRAM is the workspace every GPU operation draws from. Capacity determines what model fits; bandwidth determines how fast computation flows. HBM (used in data center GPUs) achieves 3–8× the bandwidth of GDDR6X (consumer) by physically stacking memory beside the die. The single most important GPU metric for LLM inference is memory bandwidth; for training at large batch sizes it is Tensor Core TFLOPs.

In 2026, the stack has matured: BF16 is the training default, FP8 is standard for forward passes on H100+ hardware, LoRA makes fine-tuning accessible on a single consumer GPU, and FlashAttention removes the memory bottleneck from attention. Whether you're training from scratch on a B200 cluster or running a quantized 70B model on two RTX 4090s, the same principles apply — keep your Tensor Cores fed, keep your VRAM footprint minimal, and understand which dimension of the hardware you're hitting first.

Questions or discussion? Connect on LinkedIn, X or reach out via email.

Discussion

Have thoughts on this post? Share them below — questions, corrections, or your own experience are all welcome.