Master Generative AI — Part 2: Working with LLMs¶
Part 2 of the Master Generative AI: A Step-by-Step Challenge series.
Series Map:
- Part 1 → Foundation of AI & ML
- Part 2 → Working with LLMs ← you are here
- Part 3 → Advanced Generative AI
- Part 4 → Practical Applications
- Part 5 → Career & Capstone Projects
In Part 1 you built the conceptual foundation. Now we get our hands dirty. This part is where theory becomes practice — you'll write code that tokenizes text, queries embeddings, builds a RAG pipeline, and ships your first working chatbot.
Chapter 1: Tokenization & Embeddings¶
Tokenization: Breaking Text into Pieces¶
LLMs don't see characters or words — they see tokens: sub-word chunks from a fixed vocabulary (~50,000–100,000 entries). Byte-Pair Encoding (BPE) learns this vocabulary by merging the most frequent character pairs.
from transformers import AutoTokenizer
# Load GPT-2 tokenizer (same family used by many OpenAI models)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Tokenization is the first step in NLP!"
ids = tokenizer.encode(text)
tokens = tokenizer.convert_ids_to_tokens(ids)
print(f"Token IDs: {ids}")
# [29303, 1634, 318, 262, 717, 2239, 287, 399, 19930, 0]
print(f"Tokens: {tokens}")
# ['Token', 'ization', 'Ġis', 'Ġthe', 'Ġfirst', 'Ġstep', 'Ġin', 'ĠNLP', '!']
# Ġ = space before word
# Decode back to text
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}") # Tokenization is the first step in NLP!
# Token count matters for API cost and context limits
print(f"Token count: {len(ids)}") # 10 tokens for 40 characters
Why tokenization rules matter:
# Same word, different token counts
examples = [
"Hello", # 1 token
"hello", # 1 token
"HELLO", # 2 tokens ← case changes it!
"Generative", # 1 token
"generativeAI", # 2 tokens ← no space → merged differently
"Bangkok", # 2 tokens ← less common → split
"กรุงเทพ", # 9 tokens ← Thai needs many more tokens!
]
for ex in examples:
count = len(tokenizer.encode(ex))
print(f"'{ex}': {count} token(s)")
Practical Rule
Non-English languages (especially Thai, Chinese, Arabic) use 2–5× more tokens than equivalent English text. This inflates API costs and fills context windows faster. Always test tokenization for your target language.
Embeddings: Words as Vectors in Space¶
An embedding converts a token (or sentence or document) into a dense numerical vector where semantic similarity = geometric proximity.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
sentences = [
"The cat sat on the mat.",
"A feline rested on a rug.", # semantically similar to above
"The stock market crashed today.", # unrelated
]
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}") # (3, 384) — 384 dimensions
# Cosine similarity: 1.0 = identical meaning, 0.0 = unrelated
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
sim_1_2 = cosine_similarity(embeddings[0], embeddings[1])
sim_1_3 = cosine_similarity(embeddings[0], embeddings[2])
print(f"Similarity (cat/feline): {sim_1_2:.3f}") # ~0.87 — very similar!
print(f"Similarity (cat/stocks): {sim_1_3:.3f}") # ~0.08 — unrelated
Types of embeddings:
| Type | Scope | Example Models | Use Case |
|---|---|---|---|
| Token | One token | GPT internal | Internal model computation |
| Word | One word | Word2Vec, GloVe | Classic NLP |
| Sentence | One sentence | all-MiniLM, BGE | Semantic search |
| Document | Full document | Longformer | Long document retrieval |
| Image | Image | CLIP, DINOv2 | Cross-modal search |
Vector Databases: Searching by Meaning¶
Embeddings enable semantic search — find documents by meaning, not keywords:
from sentence_transformers import SentenceTransformer
import numpy as np
# In production, use ChromaDB, Qdrant, Pinecone, or pgvector
# Here we simulate with numpy
model = SentenceTransformer("all-MiniLM-L6-v2")
# Knowledge base
documents = [
"vLLM uses PagedAttention for efficient GPU memory management.",
"The Transformer architecture was introduced by Vaswani et al. in 2017.",
"Fine-tuning adapts a pre-trained model to a specific task.",
"RAG combines retrieval with generation for grounded answers.",
"Python is the dominant language for AI and machine learning.",
]
# Index: embed all documents
doc_embeddings = model.encode(documents)
# Query
query = "How does vLLM manage memory?"
query_embedding = model.encode([query])
# Find most similar documents
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
ranked_idx = np.argsort(similarities)[::-1]
print("Top results:")
for i in ranked_idx[:3]:
print(f" [{similarities[i]:.3f}] {documents[i]}")
# [0.821] vLLM uses PagedAttention for efficient GPU memory management.
Chapter 2: Transformers & Attention Mechanism¶
Intuition: Every Word Watches Every Other Word¶
The breakthrough of the Transformer is self-attention: each token can directly attend to every other token in the sequence, learning which ones matter for its meaning.
Sentence: "The animal didn't cross the street because it was too tired."
For the word "it" → attention weights:
"The" 0.01 ← low
"animal" 0.52 ← HIGH: "it" likely refers to "animal"
"didn't" 0.02
"cross" 0.08
"the" 0.01
"street" 0.05
"because" 0.03
"it" 0.10 ← self-reference
"was" 0.05
"tired" 0.13 ← describes the subject
The Three Vectors: Q, K, V¶
For each token, the Transformer learns three projections:
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: Query — "What am I looking for?"
K: Key — "What does each token contain?"
V: Value — "What information should I pass along?"
"""
d_k = Q.size(-1)
# Step 1: Score — how relevant is each key to my query?
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
# Step 2: Mask future tokens (for causal / decoder models)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Step 3: Softmax → attention weights (sum to 1 per row)
weights = F.softmax(scores, dim=-1)
# Step 4: Weighted sum of values
output = torch.matmul(weights, V)
return output, weights
Transformer Architecture Summary¶
Input Tokens → [Embedding + Positional Encoding]
↓
[Transformer Block] × N layers:
├── Multi-Head Self-Attention
│ (12 or 32 or 96 heads in parallel)
├── Residual connection + Layer Norm
├── Feed-Forward Network (expand → GELU → compress)
└── Residual connection + Layer Norm
↓
Linear projection → Vocabulary logits → Softmax → Token probability
# Using a pretrained Transformer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Qwen/Qwen2.5-0.5B" # tiny model for demo
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
prompt = "The key to learning AI is"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=40,
temperature=0.7,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Chapter 3: Pretrained Models — GPT, BERT, LLaMA, Claude¶
The Pretrained Model Paradigm¶
Training from scratch costs millions of dollars and months of GPU time. The modern approach:
PRETRAIN once on huge data (expensive, done by big labs)
↓
RELEASE as a pretrained checkpoint (free or API)
↓
FINE-TUNE or PROMPT for your specific use case (cheap, done by you)
Model Architectures¶
GPT models generate text left-to-right. Given a prefix, predict what comes next.
BERT reads the full sentence at once (bidirectional). Best for understanding tasks.
Architecture: Encoder-only Transformer
Training: Masked language modeling (predict [MASK] tokens)
Strength: Classification, NER, question answering, embeddings
Family: BERT, RoBERTa, DistilBERT, ALBERT, DeBERTa
from transformers import pipeline
# Named Entity Recognition
ner = pipeline("ner", model="dslim/bert-base-NER")
text = "Elon Musk founded SpaceX in Hawthorne, California."
entities = ner(text)
for e in entities:
print(f"{e['word']} → {e['entity']} ({e['score']:.2f})")
# Elon → B-PER, Musk → I-PER, SpaceX → B-ORG, Hawthorne → B-LOC
Meta's open-source alternative to GPT. Run it locally, fine-tune it, deploy it anywhere.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
messages = [
{"role": "system", "content": "You are a helpful AI tutor."},
{"role": "user", "content": "Explain embeddings in one paragraph."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:],
skip_special_tokens=True)
print(response)
Anthropic's Claude — strong reasoning, safety, very long context.
import anthropic
client = anthropic.Anthropic() # uses ANTHROPIC_API_KEY env var
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
system="You are an expert AI tutor. Be concise and clear.",
messages=[
{"role": "user", "content": "What is the difference between BERT and GPT?"}
]
)
print(message.content[0].text)
Chapter 4: Fine-Tuning vs. Prompt Engineering¶
When to Choose What¶
Your task ──→ [Is a pretrained model good enough with the right prompt?]
│ YES → Prompt Engineering (cheaper, faster)
│ NO → Fine-Tuning (more investment)
↓
[Do you have 100–10,000+ labeled examples?]
│ YES → Fine-tune
│ NO → More prompt engineering or RAG
Prompt Engineering Techniques¶
Just describe the task. No examples.
Provide 2–5 examples in the prompt to demonstrate the pattern.
Ask the model to reason step by step before answering.
Fine-Tuning with LoRA (Practical)¶
Full fine-tuning updates all parameters — expensive. LoRA (Low-Rank Adaptation) adds small trainable matrices and keeps the original weights frozen:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import Dataset
# Load base model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Apply LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank — higher = more capacity, more memory
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj"], # which layers to adapt
lora_dropout=0.1,
bias="none"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 2,097,152 || all params: 1,545,636,864 || 0.14% trainable!
# Your fine-tuning dataset (instruction → response pairs)
data = Dataset.from_dict({
"text": [
"### Instruction: Summarize in one sentence.\n### Input: {long text}\n### Response: {summary}",
# ... more examples
]
})
# Train
trainer = SFTTrainer(
model=model,
train_dataset=data,
args=TrainingArguments(
output_dir="./lora-model",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4,
logging_steps=10,
),
dataset_text_field="text",
max_seq_length=512,
)
trainer.train()
trainer.save_model("./my-finetuned-model")
Chapter 5: Retrieval-Augmented Generation (RAG)¶
The Problem RAG Solves¶
LLMs have a knowledge cutoff and can hallucinate facts. RAG grounds the model's responses in real, up-to-date documents:
Without RAG:
User: "What is our refund policy?"
LLM: [makes something up — it doesn't know your policy]
With RAG:
User: "What is our refund policy?"
System: [search policy docs → find relevant passage]
[pass passage to LLM as context]
LLM: "Based on our policy: refunds are accepted within 30 days..."
↑ grounded in real document
RAG Pipeline¶
Documents → Chunk → Embed → Store in Vector DB
↓
User query → Embed → Similarity search → Top-K chunks
↓
[chunks + query] → LLM → Grounded answer
Building RAG from Scratch¶
from sentence_transformers import SentenceTransformer
import chromadb
from openai import OpenAI
import numpy as np
# Step 1: Set up embedding model and vector store
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("knowledge_base")
# Step 2: Index your documents
documents = [
"Our refund policy allows returns within 30 days with receipt.",
"Premium members get free shipping on all orders over $25.",
"Customer support is available Monday–Friday, 9am–6pm EST.",
"We accept Visa, MasterCard, Amex, and PayPal.",
"Products must be in original packaging for returns.",
]
embeddings = embed_model.encode(documents).tolist()
collection.add(
ids=[str(i) for i in range(len(documents))],
embeddings=embeddings,
documents=documents
)
print(f"Indexed {len(documents)} documents")
# Step 3: Query function
def rag_query(question: str, top_k: int = 3) -> str:
# Embed the question
q_embedding = embed_model.encode([question]).tolist()
# Retrieve most relevant documents
results = collection.query(
query_embeddings=q_embedding,
n_results=top_k
)
context_docs = results["documents"][0]
context = "\n".join(f"- {doc}" for doc in context_docs)
# Generate answer with context (using OpenAI/local model)
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Answer questions using ONLY the provided context. "
"If the context doesn't contain the answer, say so."
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}
]
)
return response.choices[0].message.content
# Test it
answer = rag_query("Can I return something after 3 weeks?")
print(answer)
# "Yes. Our policy allows returns within 30 days with a receipt."
RAG with LangChain (Production Pattern)¶
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
# Load documents
loader = PyPDFLoader("company_handbook.pdf")
docs = loader.load()
# Chunk documents (important: LLM context window is limited)
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # ~375 tokens per chunk
chunk_overlap=50, # overlap to avoid cutting mid-sentence
)
chunks = splitter.split_documents(docs)
print(f"Split into {len(chunks)} chunks")
# Embed and store
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
# Build QA chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True
)
result = qa_chain.invoke("What is the vacation policy?")
print(result["result"])
print("\nSources used:")
for doc in result["source_documents"]:
print(f" - {doc.metadata['source']} page {doc.metadata.get('page', '?')}")
Chapter 6: Building Your First Chatbot with an LLM¶
The Core Pattern: Conversation History¶
A chatbot is stateless — the model doesn't "remember" previous turns. You maintain history and send it with every request:
from openai import OpenAI
client = OpenAI()
def chat():
messages = [
{"role": "system", "content": "You are a friendly AI tutor specializing in machine learning."}
]
print("AI Tutor: Hello! Ask me anything about AI. Type 'quit' to exit.\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in ["quit", "exit", "q"]:
print("AI Tutor: Goodbye! Keep learning!")
break
# Add user message to history
messages.append({"role": "user", "content": user_input})
# Get response
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0.7,
max_tokens=500,
)
assistant_message = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_message})
print(f"\nAI Tutor: {assistant_message}\n")
print(f"[Tokens used: {response.usage.total_tokens}]")
chat()
Adding Streaming for Better UX¶
Users hate waiting. Streaming sends tokens as they're generated:
from openai import OpenAI
client = OpenAI()
def streaming_chat():
messages = [
{"role": "system", "content": "You are a helpful AI assistant."}
]
while True:
user_input = input("\nYou: ")
if user_input.lower() == "quit":
break
messages.append({"role": "user", "content": user_input})
print("\nAssistant: ", end="", flush=True)
full_response = ""
with client.chat.completions.stream(
model="gpt-4o-mini",
messages=messages,
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
full_response += text
print() # newline
messages.append({"role": "assistant", "content": full_response})
streaming_chat()
A Production-Ready Chatbot Class¶
from dataclasses import dataclass, field
from typing import Optional
from openai import OpenAI
@dataclass
class Chatbot:
system_prompt: str
model: str = "gpt-4o-mini"
temperature: float = 0.7
max_tokens: int = 1000
max_history: int = 20 # prevent context overflow
messages: list = field(default_factory=list)
_client: OpenAI = field(default_factory=OpenAI, init=False, repr=False)
def chat(self, user_message: str) -> str:
self.messages.append({"role": "user", "content": user_message})
# Trim history if too long (keep system prompt context)
if len(self.messages) > self.max_history:
self.messages = self.messages[-self.max_history:]
response = self._client.chat.completions.create(
model=self.model,
messages=[{"role": "system", "content": self.system_prompt}] + self.messages,
temperature=self.temperature,
max_tokens=self.max_tokens,
)
assistant_reply = response.choices[0].message.content
self.messages.append({"role": "assistant", "content": assistant_reply})
return assistant_reply
def reset(self):
self.messages = []
# Usage
bot = Chatbot(
system_prompt="You are a Python tutor. Explain concepts simply with short code examples.",
temperature=0.5
)
print(bot.chat("What is a decorator in Python?"))
print(bot.chat("Can you show me a real-world example?")) # remembers context!
print(bot.chat("How is it different from a wrapper function?"))
Chapter 7: Evaluating LLM Outputs¶
The Evaluation Problem¶
Unlike classical ML (accuracy = 0.94), LLM outputs are open-ended text. There's no single number that captures quality.
Automatic Metrics¶
# BLEU — n-gram overlap with reference (good for translation)
from nltk.translate.bleu_score import sentence_bleu
reference = [["the", "cat", "sat", "on", "the", "mat"]]
candidate = ["the", "cat", "is", "on", "the", "mat"]
bleu = sentence_bleu(reference, candidate)
print(f"BLEU: {bleu:.3f}") # 0.575
# ROUGE — recall-oriented (good for summarization)
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"])
reference = "The quick brown fox jumped over the lazy dog"
candidate = "The fast brown fox leaped over the sleepy dog"
scores = scorer.score(reference, candidate)
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}") # word overlap
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}") # longest common subsequence
LLM-as-Judge¶
Use a powerful LLM to evaluate another LLM's output:
from openai import OpenAI
client = OpenAI()
def llm_judge(question: str, answer: str, criteria: list[str]) -> dict:
criteria_text = "\n".join(f"- {c}" for c in criteria)
prompt = f"""Evaluate the following AI response on a scale of 1-5 for each criterion.
Question: {question}
AI Answer: {answer}
Criteria:
{criteria_text}
Respond ONLY with a JSON object like:
{{"accuracy": 4, "clarity": 5, "completeness": 3, "reasoning": "..."}}"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
import json
return json.loads(response.choices[0].message.content)
# Example
result = llm_judge(
question="What is backpropagation?",
answer="Backpropagation is an algorithm that calculates gradients by applying the chain rule backwards through a neural network, allowing us to update weights to reduce loss.",
criteria=["accuracy", "clarity", "completeness"]
)
print(result)
# {"accuracy": 5, "clarity": 4, "completeness": 3, "reasoning": "Accurate but could mention learning rate"}
Bias and Hallucination Detection¶
def check_hallucination(question: str, answer: str, source_docs: list[str]) -> dict:
"""Check if the answer is grounded in the source documents."""
context = "\n".join(source_docs)
prompt = f"""You are a fact-checker. Determine if the AI answer is:
1. GROUNDED: Every claim is supported by the source documents
2. HALLUCINATED: Contains claims not in the source documents
3. PARTIAL: Some claims grounded, some not
Source documents:
{context}
AI Answer: {answer}
Respond with JSON: {{"verdict": "...", "explanation": "...", "unsupported_claims": [...]}}"""
# Call LLM judge here (same pattern as above)
...
Common bias types to test for:
| Bias | Description | Test Example |
|---|---|---|
| Demographic | Different quality by gender, ethnicity | "Evaluate this résumé" (vary names) |
| Recency | Over-weights recent information | Ask about historical facts |
| Sycophancy | Agrees with user's incorrect premise | State wrong fact, see if model corrects |
| Verbosity | Longer = seems more confident | Check if brief answers are penalized |
Summary¶
| Topic | Key Takeaway |
|---|---|
| Tokenization | Text → token IDs; non-English = more tokens; count matters for cost |
| Embeddings | Semantic similarity in vector space; foundation of search and RAG |
| Transformers | Self-attention = each token watches all others; Q/K/V mechanism |
| Pretrained models | GPT → generation; BERT → understanding; LLaMA → open source |
| Prompt engineering | Zero-shot → few-shot → CoT; free and instant to try |
| Fine-tuning | LoRA adapts 0.1% of params; use when prompting isn't enough |
| RAG | Chunk → embed → retrieve → generate; fixes hallucination for domain knowledge |
| Chatbot | Maintain conversation history; trim when context fills; stream for UX |
| Evaluation | BLEU/ROUGE for overlap; LLM-as-judge for quality; test for bias/hallucination |
Next → Part 3: Advanced Generative AI — GANs, diffusion models, multimodal systems, and responsible AI.
Practice Challenge
Build a RAG chatbot over a PDF of your choice (company handbook, research paper, product docs):
- Load and chunk the PDF with LangChain
- Embed chunks into ChromaDB
- Wrap it in the
Chatbotclass from this chapter - Test: ask 5 questions that require reading the document
- Measure: how often does it correctly cite the source?
Questions or discussion? Connect on LinkedIn, X or reach out via email.
Discussion
Have thoughts on this post? Share them below — questions, corrections, or your own experience are all welcome.