Subword Fragmentation: Why Certain Words Break into Many Tokens
Subword fragmentation occurs when a tokenizer splits a word into multiple smaller subword units. This happens primarily with rare, complex, or linguistically unusual terms that do not have direct representations in the tokenizer’s vocabulary. Fragmentation increases sequence length, raises token costs, and may reduce model accuracy.
One major cause of fragmentation is insufficient vocabulary coverage. Tokenizers are trained on large datasets, but no vocabulary can cover every possible word in every language or technical domain. When a word is not present, the tokenizer decomposes it into smaller pieces, sometimes producing unintuitive fragments unrelated to human linguistic structure.
Morphologically rich languages experience fragmentation more frequently. Words containing multiple prefixes, suffixes, or compounded structures often break into several tokens. This increases processing overhead and affects how embeddings represent the word’s meaning.
Fragmentation can also occur with long or specialized terminology. Scientific, medical, or domain-heavy terms often contain uncommon character sequences, prompting the tokenizer to split them into reusable subwords. While functional, this process may reduce semantic clarity.
Developers can minimize fragmentation through controlled vocabulary adjustments, simpler phrasing, or token-efficient wording. In some cases, rewriting complex terms into clearer equivalents reduces both cost and ambiguity without sacrificing precision.
Understanding fragmentation helps explain unexpected token counts and prepares developers to design more efficient prompts that align with the tokenizer’s strengths.