The Art of Tokenization
Before an AI can read, it must dissect. Tokenization is the process of breaking text into the mathematical atoms of language.

Example: Breaking down a complex sentence into distinct, numbered tokens.
abc Characters
Too SmallCan represent any word, but the model struggles to grasp "meaning" from individual letters.
"C-a-t" vs "Category"
"C-a-t" vs "Category"
Word Words
Too BigEasy to understand, but the vocabulary becomes huge. Rare words or typos break the system.
"Run", "Running", "Ran" treated separately.
"Run", "Running", "Ran" treated separately.
Sub Subwords
Just RightBreaks complex words into meaningful chunks. Efficient and flexible.
"Cannibal" → "Cann" + "ibal"
"Cannibal" → "Cann" + "ibal"
Real World Example: GPT-4o
Input Text
Every artist is a cannibal, every poet is a thief.
Step 1: Tokens (The Chunks)
Every artist is a cannibal, every poet is a thief.
Step 2: Token IDs (The Numbers)
157451233738226121511631261117534896138226110688613
Did you notice? The word "is" appears twice. Both times, it is assigned the exact same ID: 382. This is how the model recognizes patterns—it doesn't read words, it processes sequences of repeating numbers.