Tokenization is the first step in processing text for AI models. Text is split into tokens, which can be words, subwords, or characters. The most common approach is Byte Pair Encoding (BPE). Token count matters because LLMs have context window limits measured in tokens. A word typically equals 1-3 tokens.







