Step 1
Filter URLs
Before fetching, the crawler skips low-quality domains, spam, adult content, malware, and file types that don't belong in the training set.
Interactive Talk
A guided walkthrough of the core concepts behind large language models.
A visual, non-technical walkthrough of how LLMs are built and why they behave the way they do.
Based on Andrej Karpathy's talk →Two phases to keep separate
Building or tuning the model by changing its parameters from examples.
Using the trained model to generate one token at a time.
Training changes the model. Inference uses the model.
We turn messy web text into a cleaner dataset the model can learn from.
Web page to dataset row
Step 1
Before fetching, the crawler skips low-quality domains, spam, adult content, malware, and file types that don't belong in the training set.
Step 2
A crawler starts with pages that look like documents people read.
Students often mix up cities, regions, and countries when learning geography.
The capital of France is Paris. It appears in this toy document as ordinary web text, exactly the kind of sentence a crawler might collect before cleaning.
After collection, this page would be reduced to plain text and stored alongside many other documents.
The best speeches usually sound like one person speaking sincerely to another person.
A good wedding toast should be personal, brief, and kind. It can include a small story, a thank-you, and a clear wish for the couple.
This is another toy source document that can move through the same scrape, process, deduplicate, and store pipeline.
Step 3
The crawler downloads more than the article: markup, navigation, styling, ads, and the actual text, all mixed together.
<article>
<h1>A Quick Guide To European Capitals</h1>
<p>Students often mix up cities, regions, and countries when learning geography.</p>
<p>The capital of France is Paris. It appears in this toy document as ordinary web text, exactly the kind of sentence a crawler might collect before cleaning.</p>
<p>After collection, this page would be reduced to plain text and stored alongside many other documents.</p>
</article> <article>
<h1>How To Give A Warm Wedding Toast</h1>
<p>The best speeches usually sound like one person speaking sincerely to another person.</p>
<p>A good wedding toast should be personal, brief, and kind. It can include a small story, a thank-you, and a clear wish for the couple.</p>
<p>This is another toy source document that can move through the same scrape, process, deduplicate, and store pipeline.</p>
</article> Step 4
Processing strips out navigation, ads, and styling, then keeps the readable text.
Students often mix up cities, regions, and countries when learning geography. The capital of France is Paris. It appears in this toy document as ordinary web text, exactly the kind of sentence a crawler might collect before cleaning. After collection, this page would be reduced to plain text and stored alongside many other documents.
The best speeches usually sound like one person speaking sincerely to another person. A good wedding toast should be personal, brief, and kind. It can include a small story, a thank-you, and a clear wish for the couple. This is another toy source document that can move through the same scrape, process, deduplicate, and store pipeline.
Step 5
More passes remove boilerplate, normalize text, reduce near-copies, and drop low-value content.
Step 6
The output is stored with metadata so it can be sampled during training.
| url | title | source | content |
|---|---|---|---|
| travel-notes.example/europe/capitals | A Quick Guide To European Capitals | Public travel notes · Geography basics | Students often mix up cities, regions, and countries when learning geography. The capital of France is Paris. It appea... |
| small-events.example/blog/wedding-toast | How To Give A Warm Wedding Toast | Personal blog · Reception advice | The best speeches usually sound like one person speaking sincerely to another person. A good wedding toast should be p... |
Illustrative Split
Capital
The capital of France is Paris.
Token count
7
Token IDs
Toast
A good wedding toast should be personal, brief, and kind.
Token count
13
Token IDs
A token window goes in. A probability map comes out. During training, the dataset tells us which token came next.
Training Loop
Probability map
One probability for every token in the vocabulary. Here, we only show the top three.
Training nudges the neural net toward the next token that actually appeared in the training text.
Probability map
One probability for every token in the vocabulary. Here, we only show the top three.
Training rewards probability maps that make the observed next token more likely in similar contexts.
The context window goes in. The parameters shape the equation. A probability map comes out.
Inside the Neural Net
Context window
A gigantic mathematical equation with many learned parameters.
Probability map
Training repeatedly tunes the parameters so this equation gets better at predicting useful next-token probability maps.
The model predicts a probability map, samples one token, appends it to the context, then runs again.
Sampling Loop
The capital of France is
At this point, we have a trained internet-text generator. Powerful, but not yet a helpful assistant.
Demo note: run local nanoGPT here.
June 12, 2017 Attention Is All You Need May 28, 2020 Language Models are Few-Shot LearnersConversation Dataset
Base model
Good at continuing internet-style text.
Assistant-shaped model
Better at replying in conversational turns.
Tool Call Loop
User asks for fresh information.
The answer may not be stored in the model's parameters.
The model emits a tool call.
The surrounding app recognizes the format.
Result returns to context.
The model continues with new information.
<|tool_call|> search({ query: "weather in Paris today" })
<|tool_result|> 72°F and clear
<|assistant|> It is clear and about 72°F in Paris today.
An SFT model behaves like an assistant. It works from the "knowledge" encoded in its parameters and whatever is visible in the context window. Tools can add fresh information to that window.
Vague recollection
Like something you read a month ago: useful, compressed, and sometimes fuzzy.
Working memory
The text currently in front of the model: specific, fresh, and directly available.
Human Preference Loop
The oranges cost 2 × $2 = $4. The total is $13, so the apples cost $13 - $4 = $9. There are 3 apples, so each apple costs $9 ÷ 3 = $3.
Correct, clear, and works step by step toward the answer.
Each apple costs $3.
Correct, but skips the reasoning the user may need.
Each apple costs $4.
Confident, but wrong.
A good wedding toast should be personal, brief, and kind. Tell one true story, name one real emotion, and end by honoring the couple.
Warm, useful, specific, and aligned with the requested task.
A good wedding toast should be funny and memorable.
Reasonable, but generic and less actionable.
Just roast the couple and tell every embarrassing story you know.
Likely entertaining in theory, but unhelpful and socially risky.
LLMs start as text-prediction machines, then become useful assistants through data, training, context, tools, and feedback.
Data becomes tokens.
Web text is collected, cleaned, tokenized, and turned into model-readable IDs.
Training tunes parameters.
The neural net learns to produce better next-token probability maps.
Inference samples outputs.
Generation is repeated sampling: predict, choose, append, repeat.
SFT shapes conversation.
Conversation examples tune the model toward assistant-style replies.
Tools add fresh context.
Special tokens let the model ask the app to search, run code, calculate, or call APIs.
RLHF shapes preference.
Human rankings nudge the assistant toward answers people prefer.
Once the basics click, these are the next doors to open.