Homer SLM

A 4.93M-Parameter Byte-Level Transformer Trained on the Odyssey

A small causal transformer initialized from random weights and trained only on Samuel Butler's English Odyssey. The result is a checkpoint-level look at the transition from useful distribution fitting to overfitting.

Dylan Tirandaz June 18, 2026 from-scratch byte LM

This is a small controlled pretraining run: one model family, one corpus, fixed decoding settings, and saved checkpoints for qualitative comparison.

The model does not use a pretrained checkpoint or tokenizer. It is a byte-level causal language model trained directly on the body text of the Odyssey. The objective is standard next-byte prediction.

The run is not intended to produce a useful general-purpose LM. The question is narrower: how much structure can a roughly 5M-parameter model extract from a single literary corpus before held-out likelihood begins to degrade?

Model and Data

I used a decoder-only transformer with learned byte embeddings and learned positional embeddings. The vocabulary is exactly the 256 possible byte values, which avoids importing any distributional assumptions from a pretrained BPE tokenizer.

Setting	Value
Parameters	4,929,536
Layers	6
Attention heads	8
Hidden size	256
Context	256 bytes
Vocabulary	256 raw byte values
Training text	616,538 bytes
Training steps	10,000

The corpus was stripped to the poem body before byte encoding, excluding Project Gutenberg boilerplate, prefaces, and footnotes. The split was contiguous: 90% train, 5% validation, 5% test. That makes the validation set later portions of the same translation, not an unrelated held-out domain.

Training Dynamics

Validation loss improved quickly through the first few thousand updates, reached its lowest observed value at 4,000 steps, and then degraded while the training loss continued to fall.

Validation loss curve bottoming near 4,000 steps and rising by 10,000 steps. — Validation loss reached its lowest observed value at 4,000 steps. The final checkpoint had much lower train loss, but substantially worse validation loss.

Step	Validation loss
1,000	1.683
2,000	1.358
3,000	1.265
4,000	1.236
5,000	1.261
6,000	1.330
7,000	1.411
8,000	1.553
9,000	1.736
10,000	1.894

Past roughly 4k-5k updates, additional optimization mostly reduced training loss while hurting held-out likelihood. For this corpus/model size, the useful checkpoint window is narrow.

The run saved checkpoints at 2,500-step intervals, so there is no saved 4,000-step checkpoint. The best observed step by validation was 4k; the best saved checkpoint is therefore probably 5k, with 2.5k undertrained and 7.5k/10k increasingly overfit.

Checkpoint Samples

The samples below use identical decoding settings. They are not cherry-picked across temperatures; the checkpoint is the only variable.

prompt = "Tell me, O Muse,"
temperature = 0.45
top_k = 20

2.5k steps

Tell me, O Muse, sir, and I do not tell yet on who can shee the
first of the suitors and on rocks; the was as I am to be in the stranger
to asse a god diving to the sea...

5k steps

Tell me, O Muse, the son of Atreus, and take the stranger
who was his perison are him with a pigs. The then went to the gods took
the ship...

Then Ulysses answered, "King Alcinous..."

7.5k steps

Tell me, O Muse, son of Elpenor, son of Pastrowia, and she had
compassion castriors; still, and when she had come up to my own gods...

10k steps

Tell me, O Muse, and Telemachus wept for a fashion ship in
the far from lawful and in the house of king'...

At 2.5k steps, the output has high-frequency Odyssey vocabulary but weak local syntax. At 5k, the sample shows better scene-level regularities: reported speech, character names, and repeated social roles. Later checkpoints are not monotonically better; they lower train loss while producing text that is more overfit and not more reliable.

Qualitative Features

A next-byte model trained on one book should not be described as understanding the text. It has no explicit representation of plot, character goals, or thematic causality. The observable behavior is distributional: which tokens, phrases, and local structures become likely under the learned model.

The recurring features in samples are exactly the high-frequency structures one would expect from this corpus:

Names: Ulysses, Telemachus, Eumaeus, Alcinous, Calypso
Settings: ships, sea, houses, shore, Troy, Ithaca
Social scenes: strangers arriving, hosts answering, suitors speaking, people sitting down
Repeated syntax: Then X answered, the son of, the gods, on board
Register: speech, lineage, movement, return, and formulaic transitions

Checkpoint	Words	Odyssey term hits	Repeated bigrams
2.5k	151	5	22
5k	128	12	12
7.5k	131	3	6
10k	127	5	6

The 5k sample had the strongest Odyssey-term density in this fixed comparison: 12 term hits in 128 words. It also reduced repeated bigrams relative to the 2.5k sample. These diagnostics are coarse, but they line up with the qualitative read: 5k is the most useful saved checkpoint.

Memorization Check

I checked generated 8-word sequences against the training text. For these samples, there were zero exact 8-gram overlaps with the source text.

This is a coarse heuristic, not a proof. A small byte model can memorize short fragments, local orthography, and corpus-specific phrase distributions without reproducing exact 8-grams. Still, the sampled passages are not direct long-span copies of the training file.

The failure mode is better described as local imitation without durable discourse structure. The model captures byte-level and phrase-level regularities, but it does not maintain coherent entities or events over long spans.

Reproducing It

Train the larger scratch model:

make train-scratch

Generate from the best saved checkpoint:

.venv/bin/python scripts/generate_scratch.py \
  --checkpoint outputs/scratch/odyssey-byte-gpt-10k \
  --weights 005000_weights.safetensors \
  --prompt "Tell me, O Muse," \
  --tokens 1000 \
  --temperature 0.45 \
  --top-k 20

Regenerate the local supporting artifacts:

make blog-artifacts

Next Experiment

The next run should not simply increase steps. The validation curve suggests using a denser checkpoint schedule around the transition region:

save every 500 steps from 2,500 to 5,500

I would also test a tokenizer trained only on the Odyssey, longer context windows, temperature sweeps per checkpoint, and a two-corpus run on the Iliad plus the Odyssey.

The practical result: for a one-book byte-level LM, checkpoint choice matters more than final-step loss. The strongest qualitative sample came from the mid-run region, not the last checkpoint.