Blog¶

June 12, 2026
in General, AI, Practices
12 min read

You can't use a USB drive as VRAM — the enterprise guide to GPU memory capacity planning in 2026

Storage isn't VRAM. eGPUs aren't a data center strategy. Shared memory isn't a capacity plan. Every quarter, an AI infrastructure team somewhere asks the same question: "we're running out of GPU memory, can we just use the SSDs?" No. Here's what actually works at scale, what doesn't, and the procurement and capacity planning playbook for GPU memory in 2026.

This post is written for AI platform engineers, infrastructure architects, and FinOps leads running shared GPU clusters. It's not about a single workstation — it's about a fleet. The unit of analysis is the rack, the budget, and the quarter.

Once the fleet exists, two companion posts cover what runs on it: engine selection and quantization (which engine, which precision, when to call an API instead) and GPU Autoscaling is Broken (scaling LLM inference under real load). This one is the layer above both: how much GPU memory to buy in the first place.

June 12, 2026
in Development, Architect, AI
10 min read

The backend-for-frontend pattern for AI apps: thin client, fat API, one job

Most AI apps in 2026 still ship as "browser → OpenAI" with an API key in the frontend bundle. This is broken in every dimension that matters: security, cost, observability, provider lock-in, and rate limiting. The fix is a Backend-for-Frontend (BFF) — a thin server between the browser and the LLM provider that owns auth, cost, streaming, and provider abstraction. The browser does UI. The BFF does everything else.

This is the pattern every serious AI product has converged on by 2026. Here's what it looks like, why each piece exists, and how to build it without over-engineering.

June 12, 2026
in Development, AI, Architect
11 min read

Streaming responses in 2026: SSE vs WebSockets vs gRPC for AI apps

Every AI app needs to stream. Most teams pick the wrong transport and spend a week debugging why the first token takes 8 seconds. The LLM is fast — the network in front of it isn't. This post compares the three transports that actually work in production (SSE, WebSockets, gRPC streaming), shows the buffering traps that kill streaming, and gives you the decision framework to pick the right one in 30 seconds.

If you've ever shipped a "ChatGPT-style" streaming endpoint and watched it work in dev, then break in staging behind nginx, this is the post for you.

June 12, 2026
in Kubernetes, AI
10 min read

GPU Autoscaling is Broken: What I Learned Scaling LLM Inference to 10K QPS

Standard Kubernetes autoscaling assumes more load = more pods = more capacity. With stateless REST APIs, that works. With LLM inference, it falls apart — and it took us three months of production pain at 10K QPS to figure out why.

This post covers the four patterns that actually worked, the exact configs we run, and the numbers before and after: p99 latency from 12s down to 1.8s, OOM kills from 3% to under 0.1%, GPU utilization from 40% to 75%.

Two companions go alongside this one: engine selection and quantization (which engine and precision to run before you scale anything) and the GPU memory capacity-planning guide (how much GPU to buy in the first place).

May 30, 2026
in AI
34 min read

The Math Behind an Entire LLM: From Tokens to Fine-Tuning

Most LLM explainers either skip the math entirely or dump you into a research paper. This one is neither. We'll walk every equation with real numbers, using the same three-token sentence — "The cat sat" — all the way from raw text to gradient updates.

By the end, you'll be able to trace any weight in any layer and know exactly what changed it, why, and by how much.

May 18, 2026
in AI
26 min read

Building an LLM from Scratch in PyTorch: The Full Lifecycle Cheatsheet

Most LLM tutorials give you one of two things: a high-level diagram with boxes and arrows, or a 10,000-line codebase with no explanation of why each piece exists.

This post is neither. It's a step-by-step lifecycle — 8 phases, each with working PyTorch code, the reasoning behind every decision, and an explicit Do / Don't list that captures the mistakes that cost most beginners weeks of wasted compute.

By the end you'll have built, trained, modernised, scaled, and aligned a language model — the exact same lifecycle that produced every major LLM you've used.

Phase 1: Core Transformer    → the engine
Phase 2: Train a Tiny LLM    → prove the pipeline works
Phase 3: Modernise           → match 2026 architecture
Phase 4: Scale Efficiently   → push past toy datasets
Phase 5: Mixture of Experts  → conditional computation
Phase 6: SFT                 → turn autocomplete into an assistant
Phase 7: Reward Modelling    → teach the model what "good" looks like
Phase 8: RLHF                → optimise for human preference

Every snippet in this post has been executed end-to-end on Python 3.13 + PyTorch 2.12, CPU only — including the full 5,000-step Phase 2 training run. Where a verification test caught a bug in an earlier draft (the KV cache, the reward model head, the PPO log-prob gather), the fix is in the code below and the test that caught it is shown so you can run it on your own implementation.

Prerequisites: comfortable Python, basic tensor operations (view, transpose, broadcasting), and the chain rule. No prior transformer experience — that's what Phase 1 is for.

June 10, 2026
in AI
8 min read

The Claude Code Project Template That Closes the Loop: PRD → Plan → Build → Validate → Track

Most people use Claude Code like a chat window: type a request, accept the diff, repeat until it works or breaks. That's fine for a one-off script. For a real project it falls apart fast — the agent forgets decisions, rebuilds things it already built, and you lose track of what's actually done.

The fix isn't a smarter model. It's a project structure that gives the agent memory, process, and guardrails. I found a template that nails this: code-template — a Claude Code starter that defines the full coding loop with 5 files and 3 slash commands. This post walks through how it works and why each piece earns its place.

June 4, 2026
in AI
47 min read

Agentic Retrieval: The Complete Guide from Document Ingestion to Compiled Knowledge

Naive RAG fails roughly 40% of the time at retrieval. Not because the LLM is bad — because what you hand it is bad. Wrong chunks, missing context, no awareness of what it doesn't know. Agentic retrieval fixes the retrieval layer, not the generation layer.

This guide covers the entire pipeline: from ingesting a raw PDF to deploying an agent that queries vector stores, SQL databases, knowledge graphs, and pre-computed summaries — and knows which one to use for each question.

May 20, 2026
in ModernApps
17 min read

AI Agent Application Demo: Putting a Brain Inside Your App

Source code: github.com/pkhamdee/coffee-agent

For decades we've built applications the same way: write a function, call the next function, handle each case with an if statement. The logic is explicit, deterministic, and completely predictable — a flowchart carved into code.

That model has a hard ceiling. When a user says something ambiguous, changes their mind mid-conversation, or combines requests in ways you didn't anticipate, the rigid-logic app breaks down. You write more and more special-case handling until the code becomes unmaintainable.

AI agents flip this model. You give your application a reasoning engine — a brain — and let it figure out what to do. This post walks through a real, runnable example: Coffee Agent, a coffee shop ordering chatbot built with NestJS, React, LangGraph, and a local LLM on Ollama.

May 19, 2026
in AI
24 min read

Agentic AI Architectures: Patterns, Frameworks, and MCP for Enterprise Systems

Most AI tutorials show you how to call an API and get a response. That's not an agent. An agent is a system that perceives, plans, acts, and adapts — autonomously — using tools, memory, and other agents to complete tasks that no single LLM call could handle.

In 2026, agentic AI is the dominant paradigm for building AI into enterprise software. Not chatbots. Not search bars with AI behind them. Full autonomous systems that can research a topic, write code, test it, file a ticket, notify a Slack channel, and self-correct when something goes wrong — without a human in the loop for every step.

This is the definitive guide. We cover every design pattern, every major framework, the Model Context Protocol that is quietly unifying the entire ecosystem, and how to wire all of it into production enterprise systems.