gaurang.dev

SQL LLM

2026-04-06

LLMs have mostly been black boxes to me, I know they needed GPUs to speed up calculations but didnt fully understand why.

My goal here was to get some understanding of the LLM inference. Looking at most of the examples online didnt really help me understand it any better.

Knowing that inference is mostly matrix multiplication, which can be presented as

SELECT SUM(a.val * b.val) FROM a JOIN b ON a.dim = b.dim;

I thought of creating an inference engine purely in SQL.

The model I used is SmolLM-135M, a small but real Llama-style transformer with 30 layers, 9 attention heads, and 576-dimensional hidden states. The weights are loaded from HuggingFace's safetensors format and flattened into a single SQLite table:

weights(name TEXT, i0 INTEGER, i1 INTEGER, val REAL)

Every weight in every layer lives here. The embedding matrix, the QKV projections, the FFN, the layer norms, all of it, as rows in a table.

Inference is a Bash script that drives sqlite3 through 9 SQL files per layer, 30 layers per token. The operations map cleanly:

It works. Given the prompt "The capital of France is", it returns " Paris".

It works, but its extremely slow taking ~6 minutes to generate one token.

all posts =>