Master Generative AI — Part 2: Working with LLMs¶

Part 2 of the Master Generative AI: A Step-by-Step Challenge series.

Series Map:

Part 1 → Foundation of AI & ML
Part 2 → Working with LLMs ← you are here
Part 3 → Advanced Generative AI
Part 4 → Practical Applications
Part 5 → Career & Capstone Projects

In Part 1 you built the conceptual foundation. Now we get our hands dirty. This part is where theory becomes practice — you'll write code that tokenizes text, queries embeddings, builds a RAG pipeline, and ships your first working chatbot.

Chapter 1: Tokenization & Embeddings¶

Tokenization: Breaking Text into Pieces¶

LLMs don't see characters or words — they see tokens: sub-word chunks from a fixed vocabulary (~50,000–100,000 entries). Byte-Pair Encoding (BPE) learns this vocabulary by merging the most frequent character pairs.

from transformers import AutoTokenizer

# Load GPT-2 tokenizer (same family used by many OpenAI models)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Tokenization is the first step in NLP!"
ids = tokenizer.encode(text)
tokens = tokenizer.convert_ids_to_tokens(ids)

print(f"Token IDs:  {ids}")
# [29303, 1634, 318, 262, 717, 2239, 287, 399, 19930, 0]

print(f"Tokens:     {tokens}")
# ['Token', 'ization', 'Ġis', 'Ġthe', 'Ġfirst', 'Ġstep', 'Ġin', 'ĠNLP', '!']
# Ġ = space before word

# Decode back to text
decoded = tokenizer.decode(ids)
print(f"Decoded:    {decoded}")  # Tokenization is the first step in NLP!

# Token count matters for API cost and context limits
print(f"Token count: {len(ids)}")  # 10 tokens for 40 characters

Why tokenization rules matter:

# Same word, different token counts
examples = [
    "Hello",          # 1 token
    "hello",          # 1 token  
    "HELLO",          # 2 tokens  ← case changes it!
    "Generative",     # 1 token
    "generativeAI",   # 2 tokens  ← no space → merged differently
    "Bangkok",        # 2 tokens  ← less common → split
    "กรุงเทพ",        # 9 tokens  ← Thai needs many more tokens!
]

for ex in examples:
    count = len(tokenizer.encode(ex))
    print(f"'{ex}': {count} token(s)")

Practical Rule

Non-English languages (especially Thai, Chinese, Arabic) use 2–5× more tokens than equivalent English text. This inflates API costs and fills context windows faster. Always test tokenization for your target language.

Embeddings: Words as Vectors in Space¶

An embedding converts a token (or sentence or document) into a dense numerical vector where semantic similarity = geometric proximity.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

sentences = [
    "The cat sat on the mat.",
    "A feline rested on a rug.",    # semantically similar to above
    "The stock market crashed today.",  # unrelated
]

embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")  # (3, 384) — 384 dimensions

# Cosine similarity: 1.0 = identical meaning, 0.0 = unrelated
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim_1_2 = cosine_similarity(embeddings[0], embeddings[1])
sim_1_3 = cosine_similarity(embeddings[0], embeddings[2])

print(f"Similarity (cat/feline): {sim_1_2:.3f}")   # ~0.87 — very similar!
print(f"Similarity (cat/stocks): {sim_1_3:.3f}")   # ~0.08 — unrelated

Types of embeddings:

Type	Scope	Example Models	Use Case
Token	One token	GPT internal	Internal model computation
Word	One word	Word2Vec, GloVe	Classic NLP
Sentence	One sentence	all-MiniLM, BGE	Semantic search
Document	Full document	Longformer	Long document retrieval
Image	Image	CLIP, DINOv2	Cross-modal search

Vector Databases: Searching by Meaning¶

Embeddings enable semantic search — find documents by meaning, not keywords:

from sentence_transformers import SentenceTransformer
import numpy as np

# In production, use ChromaDB, Qdrant, Pinecone, or pgvector
# Here we simulate with numpy

model = SentenceTransformer("all-MiniLM-L6-v2")

# Knowledge base
documents = [
    "vLLM uses PagedAttention for efficient GPU memory management.",
    "The Transformer architecture was introduced by Vaswani et al. in 2017.",
    "Fine-tuning adapts a pre-trained model to a specific task.",
    "RAG combines retrieval with generation for grounded answers.",
    "Python is the dominant language for AI and machine learning.",
]

# Index: embed all documents
doc_embeddings = model.encode(documents)

# Query
query = "How does vLLM manage memory?"
query_embedding = model.encode([query])

# Find most similar documents
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
ranked_idx = np.argsort(similarities)[::-1]

print("Top results:")
for i in ranked_idx[:3]:
    print(f"  [{similarities[i]:.3f}] {documents[i]}")
# [0.821] vLLM uses PagedAttention for efficient GPU memory management.

Chapter 2: Transformers & Attention Mechanism¶

Intuition: Every Word Watches Every Other Word¶

The breakthrough of the Transformer is self-attention: each token can directly attend to every other token in the sequence, learning which ones matter for its meaning.

Sentence: "The animal didn't cross the street because it was too tired."

For the word "it" → attention weights:
  "The"      0.01  ←  low
  "animal"   0.52  ←  HIGH: "it" likely refers to "animal"
  "didn't"   0.02
  "cross"    0.08
  "the"      0.01
  "street"   0.05
  "because"  0.03
  "it"       0.10  ←  self-reference
  "was"      0.05
  "tired"    0.13  ←  describes the subject

The Three Vectors: Q, K, V¶

For each token, the Transformer learns three projections:

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: Query  — "What am I looking for?"
    K: Key    — "What does each token contain?"
    V: Value  — "What information should I pass along?"
    """
    d_k = Q.size(-1)

    # Step 1: Score — how relevant is each key to my query?
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)

    # Step 2: Mask future tokens (for causal / decoder models)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Step 3: Softmax → attention weights (sum to 1 per row)
    weights = F.softmax(scores, dim=-1)

    # Step 4: Weighted sum of values
    output = torch.matmul(weights, V)
    return output, weights

Transformer Architecture Summary¶

Input Tokens → [Embedding + Positional Encoding]
    ↓
[Transformer Block] × N layers:
    ├── Multi-Head Self-Attention
    │     (12 or 32 or 96 heads in parallel)
    ├── Residual connection + Layer Norm
    ├── Feed-Forward Network (expand → GELU → compress)
    └── Residual connection + Layer Norm
    ↓
Linear projection → Vocabulary logits → Softmax → Token probability

# Using a pretrained Transformer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen2.5-0.5B"  # tiny model for demo
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

prompt = "The key to learning AI is"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=40,
        temperature=0.7,
        do_sample=True
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Chapter 3: Pretrained Models — GPT, BERT, LLaMA, Claude¶

The Pretrained Model Paradigm¶

Training from scratch costs millions of dollars and months of GPU time. The modern approach:

PRETRAIN once on huge data (expensive, done by big labs)
    ↓
RELEASE as a pretrained checkpoint (free or API)
    ↓
FINE-TUNE or PROMPT for your specific use case (cheap, done by you)

Model Architectures¶

GPT (Decoder-only)BERT (Encoder-only)LLaMA 3 (Open Source GPT)Claude (via API)

GPT models generate text left-to-right. Given a prefix, predict what comes next.

Architecture: Decoder-only Transformer
Training:     Causal language modeling (predict next token)
Strength:     Text generation, conversation, code
Family:       GPT-2, GPT-3, GPT-4, GPT-4o (OpenAI)
              LLaMA 1/2/3 (Meta), Mistral, Qwen, Gemma

from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")
result = generator("In 2026, generative AI", max_new_tokens=60,
                   temperature=0.8, do_sample=True)
print(result[0]["generated_text"])

BERT reads the full sentence at once (bidirectional). Best for understanding tasks.

Architecture: Encoder-only Transformer
Training:     Masked language modeling (predict [MASK] tokens)
Strength:     Classification, NER, question answering, embeddings
Family:       BERT, RoBERTa, DistilBERT, ALBERT, DeBERTa

from transformers import pipeline

# Named Entity Recognition
ner = pipeline("ner", model="dslim/bert-base-NER")
text = "Elon Musk founded SpaceX in Hawthorne, California."
entities = ner(text)
for e in entities:
    print(f"{e['word']} → {e['entity']} ({e['score']:.2f})")
# Elon → B-PER, Musk → I-PER, SpaceX → B-ORG, Hawthorne → B-LOC

Meta's open-source alternative to GPT. Run it locally, fine-tune it, deploy it anywhere.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful AI tutor."},
    {"role": "user", "content": "Explain embeddings in one paragraph."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False,
                                      add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:],
                            skip_special_tokens=True)
print(response)

Anthropic's Claude — strong reasoning, safety, very long context.

import anthropic

client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=500,
    system="You are an expert AI tutor. Be concise and clear.",
    messages=[
        {"role": "user", "content": "What is the difference between BERT and GPT?"}
    ]
)
print(message.content[0].text)

Chapter 4: Fine-Tuning vs. Prompt Engineering¶

When to Choose What¶

Your task ──→ [Is a pretrained model good enough with the right prompt?]
                    │ YES → Prompt Engineering (cheaper, faster)
                    │ NO  → Fine-Tuning (more investment)
                    ↓
              [Do you have 100–10,000+ labeled examples?]
                    │ YES → Fine-tune
                    │ NO  → More prompt engineering or RAG

Prompt Engineering Techniques¶

Zero-ShotFew-ShotChain-of-Thought (CoT)System Prompt

Just describe the task. No examples.

prompt = """Classify this review as POSITIVE, NEGATIVE, or NEUTRAL.
Review: "The food was decent but the service was slow."
Classification:"""

Provide 2–5 examples in the prompt to demonstrate the pattern.

prompt = """Classify reviews as POSITIVE, NEGATIVE, or NEUTRAL.

Review: "Best pizza I've ever had!" → POSITIVE
Review: "Waited 45 minutes, food was cold." → NEGATIVE
Review: "It was okay, nothing special." → NEUTRAL

Review: "The staff was friendly but the room was small."
Classification:"""

Ask the model to reason step by step before answering.

prompt = """Solve this step by step.

A store has 144 apples. They sell 60% on Monday and 25% of the
remaining on Tuesday. How many are left?

Step 1: ..."""

# CoT dramatically improves accuracy on math and reasoning tasks

Use the system role to set persistent behavior.

messages = [
    {
        "role": "system",
        "content": """You are a senior Python engineer.
                     Always provide code with type hints.
                     Keep explanations under 3 sentences.
                     If unsure, say so."""
    },
    {"role": "user", "content": "How do I read a CSV file?"}
]

Fine-Tuning with LoRA (Practical)¶

Full fine-tuning updates all parameters — expensive. LoRA (Low-Rank Adaptation) adds small trainable matrices and keeps the original weights frozen:

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import Dataset

# Load base model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Apply LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # rank — higher = more capacity, more memory
    lora_alpha=32,      # scaling factor
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.1,
    bias="none"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 2,097,152 || all params: 1,545,636,864 || 0.14% trainable!

# Your fine-tuning dataset (instruction → response pairs)
data = Dataset.from_dict({
    "text": [
        "### Instruction: Summarize in one sentence.\n### Input: {long text}\n### Response: {summary}",
        # ... more examples
    ]
})

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=data,
    args=TrainingArguments(
        output_dir="./lora-model",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-4,
        logging_steps=10,
    ),
    dataset_text_field="text",
    max_seq_length=512,
)
trainer.train()
trainer.save_model("./my-finetuned-model")

Chapter 5: Retrieval-Augmented Generation (RAG)¶

The Problem RAG Solves¶

LLMs have a knowledge cutoff and can hallucinate facts. RAG grounds the model's responses in real, up-to-date documents:

Without RAG:
  User: "What is our refund policy?"
  LLM:  [makes something up — it doesn't know your policy]

With RAG:
  User: "What is our refund policy?"
  System: [search policy docs → find relevant passage]
          [pass passage to LLM as context]
  LLM:  "Based on our policy: refunds are accepted within 30 days..."
         ↑ grounded in real document

RAG Pipeline¶

Documents → Chunk → Embed → Store in Vector DB
                                    ↓
User query → Embed → Similarity search → Top-K chunks
                                    ↓
              [chunks + query] → LLM → Grounded answer

Building RAG from Scratch¶

from sentence_transformers import SentenceTransformer
import chromadb
from openai import OpenAI
import numpy as np

# Step 1: Set up embedding model and vector store
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("knowledge_base")

# Step 2: Index your documents
documents = [
    "Our refund policy allows returns within 30 days with receipt.",
    "Premium members get free shipping on all orders over $25.",
    "Customer support is available Monday–Friday, 9am–6pm EST.",
    "We accept Visa, MasterCard, Amex, and PayPal.",
    "Products must be in original packaging for returns.",
]

embeddings = embed_model.encode(documents).tolist()
collection.add(
    ids=[str(i) for i in range(len(documents))],
    embeddings=embeddings,
    documents=documents
)
print(f"Indexed {len(documents)} documents")

# Step 3: Query function
def rag_query(question: str, top_k: int = 3) -> str:
    # Embed the question
    q_embedding = embed_model.encode([question]).tolist()

    # Retrieve most relevant documents
    results = collection.query(
        query_embeddings=q_embedding,
        n_results=top_k
    )
    context_docs = results["documents"][0]
    context = "\n".join(f"- {doc}" for doc in context_docs)

    # Generate answer with context (using OpenAI/local model)
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Answer questions using ONLY the provided context. "
                           "If the context doesn't contain the answer, say so."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )
    return response.choices[0].message.content

# Test it
answer = rag_query("Can I return something after 3 weeks?")
print(answer)
# "Yes. Our policy allows returns within 30 days with a receipt."

RAG with LangChain (Production Pattern)¶

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

# Load documents
loader = PyPDFLoader("company_handbook.pdf")
docs = loader.load()

# Chunk documents (important: LLM context window is limited)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,       # ~375 tokens per chunk
    chunk_overlap=50,     # overlap to avoid cutting mid-sentence
)
chunks = splitter.split_documents(docs)
print(f"Split into {len(chunks)} chunks")

# Embed and store
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# Build QA chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True
)

result = qa_chain.invoke("What is the vacation policy?")
print(result["result"])
print("\nSources used:")
for doc in result["source_documents"]:
    print(f"  - {doc.metadata['source']} page {doc.metadata.get('page', '?')}")

Chapter 6: Building Your First Chatbot with an LLM¶

The Core Pattern: Conversation History¶

A chatbot is stateless — the model doesn't "remember" previous turns. You maintain history and send it with every request:

from openai import OpenAI

client = OpenAI()

def chat():
    messages = [
        {"role": "system", "content": "You are a friendly AI tutor specializing in machine learning."}
    ]

    print("AI Tutor: Hello! Ask me anything about AI. Type 'quit' to exit.\n")

    while True:
        user_input = input("You: ").strip()
        if user_input.lower() in ["quit", "exit", "q"]:
            print("AI Tutor: Goodbye! Keep learning!")
            break

        # Add user message to history
        messages.append({"role": "user", "content": user_input})

        # Get response
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            temperature=0.7,
            max_tokens=500,
        )

        assistant_message = response.choices[0].message.content
        messages.append({"role": "assistant", "content": assistant_message})

        print(f"\nAI Tutor: {assistant_message}\n")
        print(f"[Tokens used: {response.usage.total_tokens}]")

chat()

Adding Streaming for Better UX¶

Users hate waiting. Streaming sends tokens as they're generated:

from openai import OpenAI

client = OpenAI()

def streaming_chat():
    messages = [
        {"role": "system", "content": "You are a helpful AI assistant."}
    ]

    while True:
        user_input = input("\nYou: ")
        if user_input.lower() == "quit":
            break

        messages.append({"role": "user", "content": user_input})

        print("\nAssistant: ", end="", flush=True)

        full_response = ""
        with client.chat.completions.stream(
            model="gpt-4o-mini",
            messages=messages,
        ) as stream:
            for text in stream.text_stream:
                print(text, end="", flush=True)
                full_response += text

        print()  # newline
        messages.append({"role": "assistant", "content": full_response})

streaming_chat()

A Production-Ready Chatbot Class¶

from dataclasses import dataclass, field
from typing import Optional
from openai import OpenAI

@dataclass
class Chatbot:
    system_prompt: str
    model: str = "gpt-4o-mini"
    temperature: float = 0.7
    max_tokens: int = 1000
    max_history: int = 20          # prevent context overflow
    messages: list = field(default_factory=list)
    _client: OpenAI = field(default_factory=OpenAI, init=False, repr=False)

    def chat(self, user_message: str) -> str:
        self.messages.append({"role": "user", "content": user_message})

        # Trim history if too long (keep system prompt context)
        if len(self.messages) > self.max_history:
            self.messages = self.messages[-self.max_history:]

        response = self._client.chat.completions.create(
            model=self.model,
            messages=[{"role": "system", "content": self.system_prompt}] + self.messages,
            temperature=self.temperature,
            max_tokens=self.max_tokens,
        )

        assistant_reply = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": assistant_reply})
        return assistant_reply

    def reset(self):
        self.messages = []

# Usage
bot = Chatbot(
    system_prompt="You are a Python tutor. Explain concepts simply with short code examples.",
    temperature=0.5
)

print(bot.chat("What is a decorator in Python?"))
print(bot.chat("Can you show me a real-world example?"))  # remembers context!
print(bot.chat("How is it different from a wrapper function?"))

Chapter 7: Evaluating LLM Outputs¶

The Evaluation Problem¶

Unlike classical ML (accuracy = 0.94), LLM outputs are open-ended text. There's no single number that captures quality.

Automatic Metrics¶

# BLEU — n-gram overlap with reference (good for translation)
from nltk.translate.bleu_score import sentence_bleu

reference = [["the", "cat", "sat", "on", "the", "mat"]]
candidate = ["the", "cat", "is", "on", "the", "mat"]
bleu = sentence_bleu(reference, candidate)
print(f"BLEU: {bleu:.3f}")  # 0.575

# ROUGE — recall-oriented (good for summarization)
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"])
reference = "The quick brown fox jumped over the lazy dog"
candidate = "The fast brown fox leaped over the sleepy dog"
scores = scorer.score(reference, candidate)
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}")  # word overlap
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")  # longest common subsequence

LLM-as-Judge¶

Use a powerful LLM to evaluate another LLM's output:

from openai import OpenAI

client = OpenAI()

def llm_judge(question: str, answer: str, criteria: list[str]) -> dict:
    criteria_text = "\n".join(f"- {c}" for c in criteria)

    prompt = f"""Evaluate the following AI response on a scale of 1-5 for each criterion.

Question: {question}
AI Answer: {answer}

Criteria:
{criteria_text}

Respond ONLY with a JSON object like:
{{"accuracy": 4, "clarity": 5, "completeness": 3, "reasoning": "..."}}"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    import json
    return json.loads(response.choices[0].message.content)

# Example
result = llm_judge(
    question="What is backpropagation?",
    answer="Backpropagation is an algorithm that calculates gradients by applying the chain rule backwards through a neural network, allowing us to update weights to reduce loss.",
    criteria=["accuracy", "clarity", "completeness"]
)
print(result)
# {"accuracy": 5, "clarity": 4, "completeness": 3, "reasoning": "Accurate but could mention learning rate"}

Bias and Hallucination Detection¶

def check_hallucination(question: str, answer: str, source_docs: list[str]) -> dict:
    """Check if the answer is grounded in the source documents."""
    context = "\n".join(source_docs)

    prompt = f"""You are a fact-checker. Determine if the AI answer is:
1. GROUNDED: Every claim is supported by the source documents
2. HALLUCINATED: Contains claims not in the source documents
3. PARTIAL: Some claims grounded, some not

Source documents:
{context}

AI Answer: {answer}

Respond with JSON: {{"verdict": "...", "explanation": "...", "unsupported_claims": [...]}}"""

    # Call LLM judge here (same pattern as above)
    ...

Common bias types to test for:

Bias	Description	Test Example
Demographic	Different quality by gender, ethnicity	"Evaluate this résumé" (vary names)
Recency	Over-weights recent information	Ask about historical facts
Sycophancy	Agrees with user's incorrect premise	State wrong fact, see if model corrects
Verbosity	Longer = seems more confident	Check if brief answers are penalized

Summary¶

Topic	Key Takeaway
Tokenization	Text → token IDs; non-English = more tokens; count matters for cost
Embeddings	Semantic similarity in vector space; foundation of search and RAG
Transformers	Self-attention = each token watches all others; Q/K/V mechanism
Pretrained models	GPT → generation; BERT → understanding; LLaMA → open source
Prompt engineering	Zero-shot → few-shot → CoT; free and instant to try
Fine-tuning	LoRA adapts 0.1% of params; use when prompting isn't enough
RAG	Chunk → embed → retrieve → generate; fixes hallucination for domain knowledge
Chatbot	Maintain conversation history; trim when context fills; stream for UX
Evaluation	BLEU/ROUGE for overlap; LLM-as-judge for quality; test for bias/hallucination

Next → Part 3: Advanced Generative AI — GANs, diffusion models, multimodal systems, and responsible AI.

Practice Challenge

Build a RAG chatbot over a PDF of your choice (company handbook, research paper, product docs):

Load and chunk the PDF with LangChain
Embed chunks into ChromaDB
Wrap it in the Chatbot class from this chapter
Test: ask 5 questions that require reading the document
Measure: how often does it correctly cite the source?

Questions or discussion? Connect on LinkedIn, X or reach out via email.

Discussion

Have thoughts on this post? Share them below — questions, corrections, or your own experience are all welcome.