The Art of Tokenization

Before an AI can read, it must dissect. Tokenization is the process of breaking text into the mathematical atoms of language.

Example: Breaking down a complex sentence into distinct, numbered tokens.

abc Characters

Too SmallCan represent any word, but the model struggles to grasp "meaning" from individual letters.
"C-a-t" vs "Category"

Word Words

Too BigEasy to understand, but the vocabulary becomes huge. Rare words or typos break the system.
"Run", "Running", "Ran" treated separately.

Sub Subwords

Just RightBreaks complex words into meaningful chunks. Efficient and flexible.
"Cannibal" → "Cann" + "ibal"

Real World Example: GPT-4o

Input Text

Every artist is a cannibal, every poet is a thief.

Step 1: Tokens (The Chunks)

Every artist is a cannibal, every poet is a thief.

Step 2: Token IDs (The Numbers)

157451233738226121511631261117534896138226110688613

Did you notice? The word "is" appears twice. Both times, it is assigned the exact same ID: 382. This is how the model recognizes patterns—it doesn't read words, it processes sequences of repeating numbers.