LLMs have mostly been black boxes to me, I know they needed GPUs to speed up calculations but didnt fully understand why.
My goal here was to get some understanding of the LLM inference. Looking at most of the examples online didnt really help me understand it any better.
Knowing that inference is mostly matrix multiplication, which can be presented as
SELECT SUM(a.val * b.val) FROM a JOIN b ON a.dim = b.dim;
The model I used is SmolLM-135M, a small but real Llama-style transformer with 30 layers, 9 attention heads, and 576-dimensional hidden states. The weights are loaded from HuggingFace's safetensors format and flattened into a single SQLite table:
weights(name TEXT, i0 INTEGER, i1 INTEGER, val REAL)
Inference is a Bash script that drives sqlite3 through 9 SQL files per layer, 30 layers per token. The operations map cleanly:
- Embedding lookup →
SELECT val FROM weights WHERE name = 'embed_tokens' AND i0 = token_id - Matrix multiplication →
SUM(normed.val * weights.val) GROUP BY dim - RMSNorm →
SQRT(AVG(val * val))per position - RoPE → cosine/sine rotations using
COS()andSIN()built-ins - Grouped Query Attention + Softmax →
MAX()for stability,EXP()for exponentiation,SUM()for the denominator - SwiGLU →
gate * (1 / (1 + EXP(-gate))) * up - Sampling →
ORDER BY score DESC LIMIT 1
It works, but its extremely slow taking ~6 minutes to generate one token.