What is a token in AI language models?
Short answer
A token is the unit of text a language model actually processes. Tokens are usually subword pieces, so a common word may be one token while a rare or long word splits into several. Roughly, English text averages about four characters per token.
Tokens are not words
Before a model reads text, the text is split into tokens by a tokenizer. A token is often a whole common word, but it can also be part of a word, a single character, or a piece of punctuation. Many tokenizers use an approach called byte pair encoding, which builds a vocabulary of frequent character sequences.
- Short, common words are usually one token:
the,and,cat - Longer or rarer words split into pieces:
tokenizationmight betoken+ization - Spaces and punctuation count too
- Numbers and code often use more tokens than plain prose
Why token counts matter
Models read and generate text in tokens, and both their limits and their pricing are usually measured in tokens. Knowing the token count of a prompt helps you stay within a model's context window and predict cost.