Tokenization Drift: Why Identical Text Sometimes Tokenizes Differently
Tokenization drift refers to the phenomenon where identical or nearly identical pieces of text produce different tokenization outputs under certain conditions. Although tokenization is often assumed to be a static process, real-world scenarios reveal that it can vary depending on encoding settings, model updates, text formats, or invisible characters present in the input.
One cause of drift is hidden formatting. Inputs copied from websites, PDFs, or rich text editors may include invisible Unicode characters, non-breaking spaces, or directional markers. These hidden characters alter token boundaries, causing subtle changes in tokenization even when the visible text appears identical.
Drift may also occur when tokenizers are updated. Model providers sometimes refine vocabulary, adjust merge rules, or update handling for rare characters. When this happens, text that once produced a specific set of tokens may tokenize differently in newer versions.
Encoding mismatches are another source of drift. Differences between UTF-8, Latin-1, or other encodings can produce altered byte sequences that change how tokenizers interpret the text. This is especially important when processing multilingual datasets or text from legacy systems.
Finally, contextual drift may occur during inference. Certain architectures internally reinterpret token boundaries to optimize attention or reasoning. Although this drift does not change the initial tokenization, it affects how the model interprets those tokens during generation.
Understanding tokenization drift helps developers troubleshoot inconsistencies, especially when working with datasets, prompt templates, or systems that rely on precise token counts.