gaurang.dev | SQL LLM

LLMs have mostly been black boxes to me, I know they needed GPUs to speed up calculations but didnt fully understand why.

My goal here was to get some understanding of the LLM inference. Looking at most of the examples online didnt really help me understand it any better.

Knowing that inference is mostly matrix multiplication, which can be presented as

SELECT SUM(a.val * b.val) FROM a JOIN b ON a.dim = b.dim;

I thought of creating an inference engine purely in SQL.

The model I used is SmolLM-135M, a small but real Llama-style transformer with 30 layers, 9 attention heads, and 576-dimensional hidden states. The weights are loaded from HuggingFace's safetensors format and flattened into a single SQLite table:

weights(name TEXT, i0 INTEGER, i1 INTEGER, val REAL)

Every weight in every layer lives here. The embedding matrix, the QKV projections, the FFN, the layer norms, all of it, as rows in a table.

Inference is a Bash script that drives sqlite3 through 9 SQL files per layer, 30 layers per token. The operations map cleanly:

Embedding lookup → SELECT val FROM weights WHERE name = 'embed_tokens' AND i0 = token_id
Matrix multiplication → SUM(normed.val * weights.val) GROUP BY dim
RMSNorm → SQRT(AVG(val * val)) per position
RoPE → cosine/sine rotations using COS() and SIN() built-ins
Grouped Query Attention + Softmax → MAX() for stability, EXP() for exponentiation, SUM() for the denominator
SwiGLU → gate * (1 / (1 + EXP(-gate))) * up
Sampling → ORDER BY score DESC LIMIT 1

It works. Given the prompt "The capital of France is", it returns " Paris".

It works, but its extremely slow taking ~6 minutes to generate one token.

all posts =>