10K vs 32K Tokenizers Yield Similar Bytes per Token
TL;DR
- Observation: On TinyStories, 10K and 32K tokenizers produce almost identical compression (bytes/token).
- Why: The corpus is simple and repetitive; a 10K vocab already captures common subwords, so the extra 22K tokens rarely trigger.
- Bigger gap when: Corpora are complex/diverse (OpenWebText, code, multilingual), where larger vocabs reduce token count noticeably.
At a glance: TinyStories results
Tokenizer | Vocab size | Bytes/token | Sample |
---|---|---|---|
BPE | 5K | 3.970 | 10 docs (seed 42) |
BPE | 10K | 4.058 | 10 docs (seed 42) |
BPE | 32K | 4.072 | 10 docs (seed 42) |
All are in the same ballpark; small differences arise from sampling, tokenizer variant (5K vs 10K/32K), and dataset slice. Net: little benefit from increasing vocab on this dataset.
Why the gap is small on TinyStories
- Vocabulary saturation: The lexicon is limited and repetitive. A 10K vocab already covers almost all common words/subwords (“the”, “and”, “play”, simple names), so merges capture most compression.
- Diminishing returns: The extra 22K tokens in a 32K vocab skew toward rare/complex segments that barely occur in TinyStories, so they seldom apply.
- Short, simple morphology: Few long, rare words means fewer opportunities for single-token replacements of multi-token fragments.
When you will see a bigger gap
- OpenWebText / web-scale prose: Broader vocabulary and topics. Larger vocabs can tokenize complex words as single units (e.g., “transformer”, “backpropagation”, “jurisdiction”).
- Code corpora: Identifiers and symbols benefit from longer subword units; larger vocabs capture common stems/snippets (e.g., “get_user_id”, “</div>”).
- Multilingual or domain jargon: Medical/legal or multilingual text has many low-frequency segments that a 10K vocab would split into many pieces.
Result: On complex corpora, a 32K tokenizer often yields fewer tokens for the same bytes, improving bytes/token.
How bytes/token is computed
- Definition: Average UTF-8 byte length of the raw text divided by the number of tokens produced by the tokenizer.
- For simple English, average bytes per character ≈ 1 (ASCII), so differences mainly come from token count, not byte size.
- On multilingual text, characters may be 2–4 bytes in UTF-8; both numerator and denominator shift.
Formula: bytes_per_token = total_utf8_bytes / total_tokens
Code reference
- Repository: SDcodehub/assignment1-basics
- Script: cs336_basics/compute_bytes_per_token.py
Practical guidance
- If your deployment domain resembles TinyStories (simple, repetitive), a 10K vocab is often sufficient and cheaper.
- For real-world text (OpenWebText), code, or multilingual corpora, prefer larger vocabs (e.g., 32K) for better compression.
Learned points from the latest run
- A 5K TinyStories tokenizer measured 3.970 bytes/token, close to prior 10K/32K numbers.
- This reinforces vocabulary saturation on TinyStories: smaller vocabs already capture frequent patterns.
- Differences across 5K/10K/32K on TinyStories are modest and sensitive to sampling and tokenizer variant.
- OpenWebText needs explicit paths; include the required flags to measure cross-domain differences.