LLMs and the Transformer Architecture: A Beginner's Complete Guide¶

You've chatted with ChatGPT. You've asked Claude for help. You've seen GitHub Copilot finish your sentences. But have you ever wondered what is actually happening inside these systems? How does a computer — a machine that ultimately only understands 0s and 1s — produce text that reads like it was written by a thoughtful human?

This guide answers that question from the ground up. No PhD required. We'll start with an analogy a child could follow, then gradually build up to a precise technical understanding of the Transformer architecture that powers every major LLM today.

Part 1: What Is a Language Model?¶

The Autocomplete Intuition¶

You already understand the core idea of a language model — you use one every day on your phone.

When you type "I'm on my way" into a message, your keyboard suggests "home" or "there" as the next word. That's a tiny language model. It has learned, from millions of messages, which words tend to follow other words.

A Large Language Model (LLM) is this idea taken to an extreme:

Instead of a phone keyboard that looks at your last 3 words, an LLM looks at thousands of words of context
Instead of suggesting one next word, it can generate entire essays, write code, translate languages, and reason through complex problems
Instead of being trained on a few million messages, it's trained on hundreds of billions of words from the internet, books, and scientific papers

But the core idea is the same: predict what comes next.

Language as Probability¶

Here's the mathematical intuition. A language model learns to answer the question:

"Given everything I've read so far, what is the most likely next word?"

In math notation:

P(next word | all previous words)

Example:
P("sun"  | "The sky is blue and the")  → high probability
P("soup" | "The sky is blue and the")  → low probability
P("cat"  | "The sky is blue and the")  → medium probability

During generation, the model picks a word (or more precisely, a token) based on these probabilities, then adds it to the context, and repeats. One token at a time. That's how it "writes" — it's completing the sequence, over and over.

What Makes a Language Model "Large"?¶

The word "Large" refers to the number of parameters — the numerical values the model learns during training.

Model	Year	Parameters	Context Window
GPT-2	2019	1.5 billion	1,024 tokens
GPT-3	2020	175 billion	4,096 tokens
LLaMA 2	2023	70 billion	4,096 tokens
GPT-4	2023	~1.8 trillion (est.)	128,000 tokens
Claude 3.5	2024	Unknown	200,000 tokens
Gemini Ultra	2024	Unknown	1,000,000 tokens

These parameters are like the model's "memory of everything it has read." They are adjusted during training so that the model gets better and better at predicting the next word.

Part 2: The Road to Transformers¶

Why We Needed Something Better¶

Language models existed long before Transformers. Understanding their limitations helps you appreciate why the Transformer was such a breakthrough.

Era 1: N-gram Models (1990s–2000s)¶

The earliest language models counted word sequences. An n-gram model looks at the previous n words:

Bigram (2-gram):  P(word | previous 1 word)
Trigram (3-gram): P(word | previous 2 words)
5-gram:           P(word | previous 4 words)

Problem: You can only look back 5–6 words. Real language has dependencies that span entire paragraphs.

"The bank where I deposited my salary last Tuesday is ____"
                                                       ↑
                                     needs to know "bank" = financial institution
                                     but "bank" is 10 words back — too far for n-gram

Era 2: Recurrent Neural Networks (2010s)¶

RNNs and their variants (LSTMs, GRUs) tried to solve this by maintaining a "hidden state" — a memory vector that carries information from earlier in the sequence.

Input:    "The"  "cat"  "sat"  "on"  "the"  "mat"
           ↓      ↓      ↓      ↓      ↓      ↓
RNN:      h₁ →  h₂  →  h₃  →  h₄  →  h₅  →  h₆  → output
          (hidden state passed left to right)

Problem 1: Vanishing Gradient — When you backpropagate through hundreds of steps, the learning signal shrinks to near zero. The model effectively forgets things from the beginning of long sequences.

Problem 2: Sequential, not parallel — Each step depends on the previous one. You cannot process "cat", "sat", "on" at the same time. This made training on large datasets brutally slow.

The 2017 Breakthrough¶

In 2017, a team at Google published a paper titled "Attention Is All You Need" — one of the most cited papers in all of computer science. The key insight:

You don't need to process words one at a time. You can look at all words simultaneously and let each word "attend to" (pay attention to) every other word. The relationships are explicit, not buried in a hidden state.

This architecture — the Transformer — solved both problems of the RNN era and made it possible to train models on vastly more data.

Part 3: The Transformer Architecture¶

The Big Picture¶

The Transformer is built around one central idea: the attention mechanism. Everything else is infrastructure that makes attention work well at scale.

Here is the complete Transformer architecture at a glance:

INPUT TEXT: "The cat sat on the mat"
      │
      ▼
┌─────────────────────┐
│    Tokenization     │  "The" → 464  "cat" → 5765  "sat" → 3290 ...
└──────────┬──────────┘
           │
      ▼
┌─────────────────────┐
│  Token Embeddings   │  Each token ID → 768-dimensional vector
│  +                  │
│  Positional         │  Add position information (word order)
│  Encoding           │
└──────────┬──────────┘
           │
      ▼
┌─────────────────────┐ ┐
│  Multi-Head         │ │
│  Self-Attention     │ │ × N layers
│                     │ │ (e.g., 12 in GPT-2,
│  Feed-Forward       │ │       96 in GPT-3)
│  Network            │ │
│                     │ │
│  Layer Norm         │ │
└──────────┬──────────┘ ┘
           │
      ▼
┌─────────────────────┐
│  Output Head        │  768-dim vector → 50,000-token vocabulary
│  (Linear + Softmax) │  Each token gets a probability
└──────────┬──────────┘
           │
      ▼
NEXT TOKEN PROBABILITY:  "the" 0.001, "into" 0.003, "." 0.82 ...

Let's understand each piece.

Step 1: Tokenization¶

Computers don't understand words. They understand numbers. Tokenization converts text into a sequence of integer IDs.

A token is not exactly a word — it's a chunk of text, often a word or part of a word:

"Transformer"     → ["Transform", "er"]       → [37485, 263]
"unbelievable"    → ["un", "believ", "able"]  → [912, 17482, 934]
"Hello"           → ["Hello"]                 → [15496]
"ChatGPT"         → ["Chat", "G", "PT"]       → [14055, 38, 2898]

Modern LLMs use Byte-Pair Encoding (BPE) — a vocabulary of ~50,000 common subwords. Common words are single tokens; rare words are split into pieces.

# Example: Using the tiktoken library (OpenAI's tokenizer)
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4's tokenizer

text = "The Transformer is an amazing architecture!"
tokens = enc.encode(text)
print(tokens)
# [791, 42013, 374, 459, 8056, 18112, 0]

print(len(tokens))  # 7 tokens
print(f"~{len(text)/len(tokens):.1f} chars per token")  # ~5.7 chars/token

Why does token count matter?

The context window (e.g., 128K tokens for GPT-4) limits how much text you can give the model at once
You're billed per token on API usage
Longer sequences = more computation = slower responses

Step 2: Token Embeddings¶

Once we have token IDs, we convert each ID into a vector — a list of floating-point numbers.

Think of a vector as coordinates in space:

If words were points in 2D space (simplified):
                    ↑ (royalty dimension)
             King   ●                ● Queen
                    │
─────────────────────────────────────────→ (gender dimension)
            Man ●   │   ● Woman
                    │
            Dog ●

In reality, LLMs use 768 to 12,288 dimensions (not 2). Each dimension captures some abstract linguistic feature. No human defined these dimensions — they emerge from training.

# Conceptually, an embedding table looks like this:
# vocabulary_size = 50,257 tokens
# embedding_dim   = 768

embedding_table = {
    464:   [0.21, -0.53, 0.08, ..., 0.19],   # "The"    → 768 numbers
    5765:  [0.88,  0.12, -0.77, ..., 0.44],  # "cat"    → 768 numbers
    3290:  [0.15,  0.67,  0.23, ..., -0.31], # "sat"    → 768 numbers
    # ... 50,254 more rows
}

The model learns these numbers during training. Similar words end up with similar vectors (close together in the 768-dimensional space).

Step 3: Positional Encoding¶

Attention (which we'll cover next) treats all tokens equally — it doesn't inherently know that "cat" comes before "sat". We need to inject position information.

Positional encoding adds a unique signal to each token's vector based on its position:

Token "cat" at position 2:
  embedding vector:   [0.88,  0.12, -0.77, ...]
  positional signal:  [0.01,  0.84,  0.00, ...]
  combined:           [0.89,  0.96, -0.77, ...]

Token "cat" at position 5 (if it appeared again):
  embedding vector:   [0.88,  0.12, -0.77, ...]
  positional signal:  [0.48, -0.21,  0.32, ...]
  combined:           [1.36, -0.09, -0.45, ...]  ← different!

This way, the model can tell "cat at position 2" from "cat at position 5" even though they're the same word.

Modern models like GPT and LLaMA use Rotary Positional Embeddings (RoPE) — a more flexible approach that handles very long sequences better. But the idea is the same: inject position into the representation.

Step 4: The Attention Mechanism ← The Core Innovation¶

This is the heart of the Transformer. Everything else is scaffolding.

The Intuition: Who Should I Listen To?¶

Imagine you're translating this sentence:

"The animal didn't cross the street because it was too tired."

What does "it" refer to? The animal or the street? A human reader immediately knows it's the animal, because animals get tired, streets don't. The model needs to make this same connection.

Attention lets each word ask: "Which other words in this sentence are most relevant to understanding me?"

For the word "it": - "animal" → HIGH attention (likely what "it" refers to) - "street" → LOW attention - "tired" → HIGH attention (describes the referent) - "cross" → MEDIUM attention (the action involving the referent)

The Query–Key–Value Mechanism¶

Attention uses three vectors for each token:

For each token, we compute three vectors:
  Q (Query):  "What am I looking for?"
  K (Key):    "What do I contain / advertise?"
  V (Value):  "What information do I actually pass along?"

Think of it like a search engine:

You type a search query:  Q = "best pizza in Bangkok"
                                    ↓
Website 1 has key:        K₁ = "pizza delivery Bangkok"    → relevance: 0.92
Website 2 has key:        K₂ = "Thai food recipes"         → relevance: 0.01
Website 3 has key:        K₃ = "restaurant Bangkok review" → relevance: 0.78
                                    ↓
Results pulled:           V₁ (most content) + V₃ (some content)

Attention Computation (Step by Step)¶

import numpy as np

def attention(Q, K, V):
    """
    Q: (seq_len, d_k) — queries
    K: (seq_len, d_k) — keys
    V: (seq_len, d_v) — values
    d_k: dimension of keys/queries
    """
    d_k = Q.shape[-1]

    # Step 1: Compute similarity scores between every Q and every K
    # Shape: (seq_len, seq_len) — "how relevant is each word to each other word?"
    scores = Q @ K.transpose(-2, -1)  # dot product

    # Step 2: Scale down (prevents exploding gradients with large d_k)
    scores = scores / np.sqrt(d_k)

    # Step 3: Softmax — turn scores into probabilities (sum to 1)
    # This gives us the attention weights: "how much to attend to each word"
    attention_weights = softmax(scores, axis=-1)

    # Step 4: Weighted sum of values
    # Each token gets a mix of other tokens' values, weighted by attention
    output = attention_weights @ V

    return output, attention_weights

# Visualizing attention weights for "The animal didn't cross the street because it was tired"
# For the token "it" (simplified example):
attention_for_it = {
    "The":     0.01,
    "animal":  0.42,   # ← high: "it" likely refers to "animal"
    "didn't":  0.03,
    "cross":   0.08,
    "the":     0.01,
    "street":  0.06,
    "because": 0.04,
    "it":      0.12,
    "was":     0.05,
    "tired":   0.18,   # ← high: describes the referent's state
}

Visually, the attention matrix for a 6-token sequence looks like this:

Attention weights (each row sums to 1.0):

             The   cat   sat   on   the   mat
        The [0.35  0.15  0.10  0.05 0.25  0.10]
        cat [0.12  0.45  0.18  0.04 0.08  0.13]
        sat [0.05  0.28  0.38  0.12 0.06  0.11]
        on  [0.08  0.09  0.12  0.42 0.15  0.14]
        the [0.22  0.07  0.06  0.11 0.30  0.24]
        mat [0.06  0.14  0.10  0.13 0.28  0.29]

"cat" heavily attends to itself (0.45) but also to "sat" (0.18)
"mat" attends to "the" (0.28) — the article before it

Multi-Head Attention¶

One set of Q/K/V learns one type of relationship. But language has many simultaneously relevant relationships — grammatical, semantic, coreference, positional.

Multi-Head Attention runs the attention mechanism h times in parallel, each with its own learned Q/K/V projection matrices:

Input embeddings (768-dim)
        │
   ┌────┴────┐
   ▼         ▼         ...         ▼
 Head 1    Head 2               Head 12
(grammar) (meaning)           (position)
   │         │                    │
   ▼         ▼                    ▼
attention  attention           attention
output     output              output
   │         │                    │
   └────┬────┘
        │ concatenate all 12 heads
        ▼
  Linear projection back to 768-dim
        │
        ▼
  Combined representation (768-dim)

Each head "specializes" in different patterns — some learn syntactic relationships, others semantic ones. The model discovers these specializations automatically during training.

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0

        self.d_model = d_model      # e.g., 768
        self.num_heads = num_heads  # e.g., 12
        self.d_k = d_model // num_heads  # 64 per head

        # Learned projection matrices
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)  # output projection

    def split_heads(self, x: torch.Tensor) -> torch.Tensor:
        """Reshape (batch, seq, d_model) → (batch, heads, seq, d_k)"""
        batch, seq, _ = x.shape
        x = x.view(batch, seq, self.num_heads, self.d_k)
        return x.transpose(1, 2)

    def forward(self, x: torch.Tensor, mask=None) -> torch.Tensor:
        batch, seq, _ = x.shape

        # Project to Q, K, V and split into heads
        Q = self.split_heads(self.W_q(x))  # (batch, heads, seq, d_k)
        K = self.split_heads(self.W_k(x))
        V = self.split_heads(self.W_v(x))

        # Scaled dot-product attention (all heads in parallel)
        scores = (Q @ K.transpose(-2, -1)) / (self.d_k ** 0.5)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        weights = torch.softmax(scores, dim=-1)
        attended = weights @ V  # (batch, heads, seq, d_k)

        # Concatenate heads and project back
        attended = attended.transpose(1, 2).contiguous()
        attended = attended.view(batch, seq, self.d_model)
        return self.W_o(attended)

Step 5: The Feed-Forward Network¶

After attention, each token's representation passes through a feed-forward network — a simple two-layer neural network applied independently to each token position:

Input (768-dim)
      │
      ▼
Linear (768 → 3072)  ← expand to 4× width
      │
      ▼
GELU activation      ← non-linearity (like ReLU but smoother)
      │
      ▼
Linear (3072 → 768)  ← compress back
      │
      ▼
Output (768-dim)

The expansion-then-compression pattern lets the network "think" in a higher-dimensional space before synthesizing. This is where a lot of the model's factual knowledge is thought to be stored.

class FeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),   # 768 → 3072
            nn.GELU(),
            nn.Linear(d_ff, d_model),   # 3072 → 768
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

Step 6: Layer Normalization and Residual Connections¶

Two engineering tricks make deep Transformers trainable:

Residual connections (skip connections): Add the input directly to the output of each sublayer. This means gradients can flow freely during training without vanishing.

x → [Multi-Head Attention] → x'
↓                              ↓
└──────────────────────────── + ← ADD input x to output x'
                                ↓
                           Layer Norm
                                ↓
                           [Feed-Forward]
                                ↓ + (add x again)
                           Layer Norm

Layer Normalization stabilizes training by normalizing the activations within each token's representation to have mean≈0 and std≈1.

Step 7: Putting It Together — The Transformer Block¶

One Transformer block combines all the above:

class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask=None) -> torch.Tensor:
        # Sub-layer 1: Self-Attention with residual
        attended = self.attention(self.norm1(x), mask)  # Pre-norm variant
        x = x + self.dropout(attended)                  # Residual connection

        # Sub-layer 2: Feed-Forward with residual
        ff_out = self.feed_forward(self.norm2(x))
        x = x + self.dropout(ff_out)                    # Residual connection

        return x

A complete LLM stacks 12 to 96 of these blocks on top of each other. Each layer refines the representation — early layers handle syntax and spelling, later layers handle reasoning and semantics.

Encoder vs. Decoder vs. Encoder-Decoder¶

The original Transformer paper had two sides — an encoder and a decoder. Modern LLMs use different subsets:

┌──────────────────────────────────────────────────────────────────┐
│                      Architecture Types                           │
├─────────────────┬────────────────────┬───────────────────────────┤
│   Encoder-only  │  Decoder-only      │  Encoder-Decoder          │
│   (BERT family) │  (GPT family)      │  (T5, BART, original)     │
├─────────────────┼────────────────────┼───────────────────────────┤
│ Reads the whole │ Reads left-to-      │ Encoder reads full input; │
│ input at once   │ right; can only    │ Decoder generates output  │
│ (bidirectional) │ see past tokens    │ token by token            │
│                 │ (causal/unidirec.) │                           │
├─────────────────┼────────────────────┼───────────────────────────┤
│ Best for:       │ Best for:          │ Best for:                 │
│ Classification  │ Text generation    │ Translation               │
│ NER             │ Completion         │ Summarization             │
│ Question ans.   │ Chat               │ Question answering        │
│ Embeddings      │ Code generation    │                           │
├─────────────────┼────────────────────┼───────────────────────────┤
│ Examples:       │ Examples:          │ Examples:                 │
│ BERT, RoBERTa   │ GPT-2/3/4          │ T5, BART, mT5             │
│ DistilBERT      │ Claude, LLaMA      │                           │
│ ALBERT          │ Gemini, Mistral    │                           │
└─────────────────┴────────────────────┴───────────────────────────┘

Modern chat LLMs (GPT-4, Claude, LLaMA, Gemini) are all decoder-only Transformers. They generate text by predicting one token at a time, appending each predicted token to the context, and repeating.

Causal (decoder) masking prevents the model from "cheating" by looking ahead:

Tokens:   "The"  "cat"  "sat"  "on"  "the"  "mat"

When predicting "sat", the model can only see:
  ✓  "The"   (position 0)
  ✓  "cat"   (position 1)
  ✗  "on"    (future — masked)
  ✗  "the"   (future — masked)
  ✗  "mat"   (future — masked)

Attention mask:
      The  cat  sat  on   the  mat
The  [ 1    0    0    0    0    0 ]
cat  [ 1    1    0    0    0    0 ]
sat  [ 1    1    1    0    0    0 ]
on   [ 1    1    1    1    0    0 ]
the  [ 1    1    1    1    1    0 ]
mat  [ 1    1    1    1    1    1 ]

Part 4: How LLMs Are Trained¶

Training a large language model happens in three distinct phases. Each one builds on the last.

Phase 1: Pre-training        Phase 2: Fine-tuning      Phase 3: RLHF
─────────────────────        ────────────────────       ────────────────
Raw internet text        →   Instruction datasets  →   Human preferences
Billions of tokens           Millions of examples       Reward model
Self-supervised              Supervised learning        PPO algorithm
(no human labels)            (human-curated)            (human-aligned)

Result: Knows a lot          Result: Follows            Result: Helpful,
but can't follow             instructions               harmless, honest
instructions well

Phase 1: Pre-Training¶

Pre-training is the most expensive phase — it requires thousands of GPUs running for months and costs tens to hundreds of millions of dollars.

The training objective is called Causal Language Modeling (for decoder models):

Given: "The cat sat on the"
Predict: "mat"

Given: "The cat sat on the mat"
Predict: "."

This is called "self-supervised" because the labels come from the text itself.
No human labeling is needed — the internet is the training data and the answer key.

Training data sources for modern LLMs:

Source	Examples	% of typical pre-training data
Web crawl	CommonCrawl, C4	60–80%
Books	Project Gutenberg, Books1/2	5–15%
Wikipedia	100+ languages	2–5%
Code	GitHub, Stack Overflow	5–15%
Scientific papers	ArXiv, PubMed	2–5%
Curated datasets	OpenWebText, Pile	5–10%

What "learning" actually means:

During training, for every sequence in the training data, the model: 1. Predicts the next token 2. Compares the prediction to the actual next token 3. Computes the loss (how wrong it was) 4. Backpropagates — adjusts all parameters slightly to reduce the loss 5. Repeats billions of times

# Conceptual training loop (simplified)
def pretrain_step(model, batch_tokens, optimizer):
    """
    batch_tokens: (batch_size, seq_len) integer token IDs
    """
    # Inputs: all tokens except the last
    inputs = batch_tokens[:, :-1]   # "The cat sat on the"

    # Targets: all tokens except the first (shifted by 1)
    targets = batch_tokens[:, 1:]   # "cat sat on the mat"

    # Forward pass: model predicts next token at each position
    logits = model(inputs)  # (batch, seq-1, vocab_size)

    # Cross-entropy loss: how wrong were our predictions?
    loss = cross_entropy(
        logits.view(-1, vocab_size),
        targets.view(-1)
    )

    # Backward pass: compute gradients
    optimizer.zero_grad()
    loss.backward()

    # Gradient clipping (prevents exploding gradients)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # Update parameters
    optimizer.step()

    return loss.item()

The scale of pre-training:

GPT-3 training (2020):
  Parameters:       175 billion
  Training tokens:  300 billion
  Hardware:         ~10,000 V100 GPUs
  Training time:    Several months
  Estimated cost:   ~$4.6 million

LLaMA 3.1 (Meta, 2024):
  Parameters:       405 billion (largest variant)
  Training tokens:  15 trillion
  Hardware:         16,000+ H100 GPUs
  Estimated cost:   > $100 million

Phase 2: Supervised Fine-Tuning (SFT)¶

A pre-trained model is a next-token predictor. It will happily complete "Write a poem about death: " with a poem... but also complete "How do I make a bomb: " with instructions. It has no concept of being a helpful assistant.

Supervised Fine-Tuning teaches the model to follow instructions by training it on a dataset of (instruction, ideal response) pairs:

Training examples look like:

Example 1:
  Human: What is the capital of France?
  Assistant: The capital of France is Paris.

Example 2:
  Human: Write a Python function to reverse a string.
  Assistant: Here's a Python function that reverses a string:
             def reverse_string(s: str) -> str:
                 return s[::-1]

Example 3:
  Human: Summarize this article: [long article text]
  Assistant: [concise summary]

The model is trained to generate the "Assistant" part given the "Human" part. This is still supervised learning — but now the labels are human-written ideal responses.

Where does the SFT data come from?

Human annotators write high-quality responses (expensive)
GPT-4 / Claude generates responses that humans then filter (cheaper)
Open datasets: Alpaca, ShareGPT, OpenAssistant, Dolly

Fine-tuning a 7B model typically costs hundreds to thousands of dollars on rented GPUs — vastly cheaper than pre-training.

Phase 3: Reinforcement Learning from Human Feedback (RLHF)¶

SFT produces a model that follows instructions, but it may still produce responses that are technically correct but unhelpful, verbose, or subtly misleading. RLHF aligns the model more closely with human preferences.

RLHF has three steps:

Step 3a: Train a Reward Model¶

Generate multiple responses to the same prompt. Have human raters rank them from best to worst. Train a separate neural network to predict human preference scores:

Prompt: "Explain quantum entanglement to a 10-year-old"

Response A: "Quantum entanglement occurs when particles become
             correlated in their quantum states such that..."
             → Human score: 3/10 (too technical)

Response B: "Imagine you have two magic coins that are best friends.
             Whenever you flip one and it lands heads, the other
             coin — even if it's on the other side of the world —
             always lands tails. That's quantum entanglement!"
             → Human score: 9/10 (perfect for audience)

Reward model learns: Response B >> Response A for this prompt

Step 3b: Optimize the LLM with Reinforcement Learning¶

Use the reward model as a "grader" and use Proximal Policy Optimization (PPO) — a reinforcement learning algorithm — to fine-tune the LLM to produce responses that score highly:

For each prompt:
  1. LLM generates a response
  2. Reward model scores the response (e.g., 7.2/10)
  3. PPO updates the LLM parameters to increase future scores
  4. A "KL penalty" prevents the model from drifting too far from SFT
     (prevents reward hacking — generating nonsense with high scores)

Step 3c: Modern Variants — DPO and GRPO¶

RLHF with PPO is complex and unstable. Recent work has proposed simpler alternatives:

DPO (Direct Preference Optimization, 2023): Instead of training a separate reward model, directly optimize the LLM on preference data. Simpler, more stable, nearly as effective.

GRPO (Group Relative Policy Optimization, 2024): Used in DeepSeek-R1. Generates multiple responses per prompt and uses their relative quality as the training signal — no reward model needed.

Part 5: Key Concepts Every LLM User Should Know¶

Context Window¶

The context window is the maximum number of tokens the model can "see" at once. Everything outside the context window is invisible to the model.

Context window = 8,192 tokens ≈ 6,000 words ≈ 20 pages

Everything within this window:
  Your system prompt    ← always included
  Conversation history  ← included until window fills up
  Your current message  ← included
  Model's response      ← generated here

Everything outside: model has no access to it

When the conversation exceeds the context window, older messages fall out — the model doesn't "remember" them unless summarized.

Temperature and Sampling¶

When the model produces a probability distribution over the vocabulary, how do you pick the next token?

Temperature controls the randomness:

import torch

logits = torch.tensor([3.0, 1.0, 0.5, 0.1])   # raw model output

# Temperature = 0.0 (greedy — always pick the top)
#   Always deterministic, can get repetitive
probs_greedy = torch.argmax(logits)

# Temperature = 1.0 (default — use probabilities as-is)
probs_1 = torch.softmax(logits / 1.0, dim=-1)
# [0.71, 0.13, 0.08, 0.06] → usually pick first, occasionally others

# Temperature = 0.3 (more focused, less creative)
probs_low = torch.softmax(logits / 0.3, dim=-1)
# [0.97, 0.02, 0.01, 0.00] → almost always first token

# Temperature = 1.5 (more creative, more random)
probs_high = torch.softmax(logits / 1.5, dim=-1)
# [0.51, 0.22, 0.16, 0.11] → more variety

Top-p (nucleus) sampling: Only consider tokens that together account for top p% of the probability mass. Prevents the model from picking very unlikely garbage tokens.

Top-k sampling: Only consider the top k most likely tokens.

Tokens per Second¶

LLMs generate one token at a time. Generation speed is measured in tokens per second (t/s):

GPT-4 API:          ~50–100 t/s
Claude Sonnet:      ~80–120 t/s
Local LLaMA 3 7B:   ~30–50 t/s (on consumer GPU)
Local LLaMA 3 70B:  ~5–15 t/s (on consumer GPU)

At 80 t/s, generating 1,000 words (~750 tokens) takes about 9 seconds.

Emergent Abilities¶

One of the strangest things about large language models is emergence — capabilities that appear suddenly as model size crosses certain thresholds, with no obvious precursor in smaller models.

Ability               Approximate scale where it emerges
──────────────────────────────────────────────────────
Arithmetic (3-digit)  ~13B parameters
Chain-of-thought      ~100B parameters
Instruction-following ~175B parameters
In-context learning   ~6B parameters
Code generation       ~12B parameters
Multi-step reasoning  ~540B parameters

Nobody fully understands why this happens. It's one of the most active research areas in AI.

Part 6: The LLM Landscape (2024–2026)¶

Major Models¶

Open Source:
  LLaMA 3.1 (Meta)     — 8B, 70B, 405B  — Permissive license
  Mistral/Mixtral       — 7B, 8x7B MoE   — Strong at its size
  Qwen 2.5 (Alibaba)   — 7B to 72B      — Excellent multilingual
  DeepSeek V3/R1        — 671B MoE       — Top reasoning performance
  Gemma 3 (Google)      — 1B to 27B      — Small, efficient

Closed Source (API):
  GPT-4o (OpenAI)      — Best overall general capability
  Claude 3.5/4 (Anthropic) — Best writing, reasoning, safety
  Gemini 2.0 (Google)  — Best multimodal, long context
  Grok (xAI)           — Fast, unfiltered

Specialized:
  Codestral (Mistral)  — Best for code
  DeepSeek-Coder       — Code, math
  Whisper (OpenAI)     — Speech to text
  DALL-E 3 / Flux      — Image generation

Model Size vs. Capability¶

Model Size  │ Use Case                        │ Hardware
────────────┼────────────────────────────────┼──────────────────────
1B–3B       │ Edge devices, simple tasks     │ Phone, Raspberry Pi
7B–13B      │ Local dev, most daily tasks    │ 8GB GPU (RTX 3070)
70B         │ Near-GPT-4 quality             │ 48GB GPU or 2×24GB
405B+       │ Frontier capability            │ Multi-GPU cluster
API only    │ Best quality, no local setup   │ Cloud (pay per token)

Mixture of Experts (MoE)¶

Modern large models often use Mixture of Experts (MoE) — instead of using all parameters for every token, the model routes each token through only a subset of "expert" networks:

Dense model (e.g., GPT-3):
  175B parameters, ALL used for every token
  → 175B FLOPs per token

MoE model (e.g., Mixtral 8x7B):
  56B total parameters, but only 2 experts (14B) active per token
  → 14B FLOPs per token, but quality close to a 56B dense model!

Benefit: More total capacity at lower inference cost

Part 7: Hands-On — Using an LLM with Python¶

Calling the API¶

from anthropic import Anthropic

client = Anthropic()

# Simple completion
message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": "Explain the attention mechanism in one paragraph."
        }
    ]
)

print(message.content[0].text)
print(f"\nTokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")

Running a Local Model¶

# Using the transformers library (Hugging Face)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,   # Use bfloat16 to halve memory usage
    device_map="auto"              # Automatically distribute across GPUs
)

# Prepare a chat message
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the Transformer architecture?"}
]

# Apply the chat template (converts messages to the model's expected format)
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )

# Decode
response = tokenizer.decode(
    output_ids[0][inputs.input_ids.shape[1]:],
    skip_special_tokens=True
)
print(response)

Token Counting¶

import tiktoken

def count_tokens(text: str, model: str = "gpt-4") -> dict:
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)

    return {
        "token_count": len(tokens),
        "char_count": len(text),
        "chars_per_token": len(text) / len(tokens),
        "estimated_cost_usd": len(tokens) / 1000 * 0.03  # GPT-4 input price
    }

# Example
article = """
The Transformer architecture, introduced by Vaswani et al. in 2017,
revolutionized natural language processing by replacing recurrent networks
with a purely attention-based approach.
"""

stats = count_tokens(article)
print(f"Tokens: {stats['token_count']}")
print(f"Est. cost: ${stats['estimated_cost_usd']:.4f}")

Part 8: Mental Models and Common Misconceptions¶

What LLMs Actually Are (and Aren't)¶

Misconception	Reality
"LLMs understand language like humans do"	LLMs compress statistical patterns from text. Understanding is debated.
"LLMs think step by step internally"	They generate token by token. Chain-of-thought is in the output, not the process.
"LLMs have a fixed knowledge cutoff"	Yes — they don't know about events after their training data ends.
"Bigger is always better"	A well-trained 7B model often beats a poorly-trained 70B model.
"LLMs hallucinate because they lie"	They hallucinate because they're completing patterns, not retrieving facts.
"The model 'reads' your whole prompt at once"	Yes — attention processes all input tokens in parallel.
"Temperature makes the model smarter"	Temperature controls randomness, not capability.

The Hallucination Problem¶

LLMs generate the most probable next token given their training. This means:

When asked something they don't know, they'll generate plausible-sounding text rather than saying "I don't know"
Facts are mixed with patterns — the model can't distinguish reliably
Confidence in output does not correlate with accuracy

Mitigations: Retrieval-Augmented Generation (RAG) — ground the LLM's responses in retrieved documents — reduces hallucination significantly but doesn't eliminate it.

Summary: How an LLM is Built and Trained¶

┌──────────────────────────────────────────────────────────────────┐
│                  LLM: From Idea to Conversation                   │
├──────────────────────────────────────────────────────────────────┤
│                                                                    │
│  ARCHITECTURE                                                      │
│  Input text → Tokenize → Embed → Positional Encoding              │
│      → [Transformer Block × N]                                    │
│          ├── Multi-Head Self-Attention (WHO to attend to)         │
│          ├── Feed-Forward Network (WHAT to transform it to)       │
│          ├── Layer Norm + Residual Connections (stability)        │
│      → Linear + Softmax → Next token probability                  │
│                                                                    │
│  TRAINING — THREE PHASES                                          │
│  Phase 1: Pre-training                                            │
│    Data: Trillions of tokens from the internet                    │
│    Task: Predict next token (self-supervised)                     │
│    Cost: $10M–$100M+, months of GPU time                         │
│    Result: Knows language, facts, reasoning patterns              │
│                                                                    │
│  Phase 2: Supervised Fine-Tuning (SFT)                           │
│    Data: Human-written (instruction, response) pairs              │
│    Task: Learn to follow instructions                             │
│    Cost: $1K–$100K                                                │
│    Result: Follows instructions, answers questions                │
│                                                                    │
│  Phase 3: RLHF / DPO                                             │
│    Data: Human preferences (which response is better?)            │
│    Task: Maximize reward (human approval)                         │
│    Cost: $10K–$1M                                                 │
│    Result: Helpful, harmless, honest responses                    │
│                                                                    │
└──────────────────────────────────────────────────────────────────┘

The Transformer architecture is remarkably simple at its core — embed tokens, let them attend to each other, transform them through feed-forward layers, predict what comes next. The magic emerges not from the architecture's complexity, but from its scale, the quality of training data, and the ingenuity of the training process.

Every major language model you use today — GPT-4, Claude, Gemini, LLaMA — is a variant of the same architecture introduced in that 2017 paper. The field has advanced enormously through better training techniques, larger datasets, and architectural refinements, but the fundamental insight of "attention is all you need" has stood the test of time.

Questions or discussion? Connect on LinkedIn, X or reach out via email.

Discussion

Have thoughts on this post? Share them below — questions, corrections, or your own experience are all welcome.