LLMs
101

Interactive Talk

A guided walkthrough of the core concepts behind large language models.

Opening

Let's demystify LLMs like ChatGPT.

A visual, non-technical walkthrough of how LLMs are built and why they behave the way they do.

Based on Andrej Karpathy's talk →

Two phases to keep separate

Training

Building or tuning the model by changing its parameters from examples.

Inference

Using the trained model to generate one token at a time.

Training changes the model. Inference uses the model.

Part One

First, gather the training data.

We turn messy web text into a cleaner dataset the model can learn from.

Web page to dataset row

Step 1

Filter URLs

Before fetching, the crawler skips low-quality domains, spam, adult content, malware, and file types that don't belong in the training set.

Step 2

Rendered web page

A crawler starts with pages that look like documents people read.

Show example web page

Travel note excerpt

A Quick Guide To European Capitals

Public travel notes · Geography basics

Students often mix up cities, regions, and countries when learning geography.

The capital of France is Paris. It appears in this toy document as ordinary web text, exactly the kind of sentence a crawler might collect before cleaning.

After collection, this page would be reduced to plain text and stored alongside many other documents.

Wedding blog excerpt

How To Give A Warm Wedding Toast

Personal blog · Reception advice

The best speeches usually sound like one person speaking sincerely to another person.

A good wedding toast should be personal, brief, and kind. It can include a small story, a thank-you, and a clear wish for the couple.

This is another toy source document that can move through the same scrape, process, deduplicate, and store pipeline.

Step 3

Under the hood: HTML

The crawler downloads more than the article: markup, navigation, styling, ads, and the actual text, all mixed together.

Show example HTML

<article>
  <h1>A Quick Guide To European Capitals</h1>
  <p>Students often mix up cities, regions, and countries when learning geography.</p>
  <p>The capital of France is Paris. It appears in this toy document as ordinary web text, exactly the kind of sentence a crawler might collect before cleaning.</p>
  <p>After collection, this page would be reduced to plain text and stored alongside many other documents.</p>
</article>

<article>
  <h1>How To Give A Warm Wedding Toast</h1>
  <p>The best speeches usually sound like one person speaking sincerely to another person.</p>
  <p>A good wedding toast should be personal, brief, and kind. It can include a small story, a thank-you, and a clear wish for the couple.</p>
  <p>This is another toy source document that can move through the same scrape, process, deduplicate, and store pipeline.</p>
</article>

Step 4

Extract plain text

Processing strips out navigation, ads, and styling, then keeps the readable text.

Show extracted text

Students often mix up cities, regions, and countries when learning geography.

The capital of France is Paris. It appears in this toy document as ordinary web text, exactly the kind of sentence a crawler might collect before cleaning.

After collection, this page would be reduced to plain text and stored alongside many other documents.

The best speeches usually sound like one person speaking sincerely to another person.

A good wedding toast should be personal, brief, and kind. It can include a small story, a thank-you, and a clear wish for the couple.

This is another toy source document that can move through the same scrape, process, deduplicate, and store pipeline.

Step 5

Filter and deduplicate

More passes remove boilerplate, normalize text, reduce near-copies, and drop low-value content.

Step 6

Store a corpus row

The output is stored with metadata so it can be sampled during training.

Show example row

url	title	source	content
travel-notes.example/europe/capitals	A Quick Guide To European Capitals	Public travel notes · Geography basics	Students often mix up cities, regions, and countries when learning geography. The capital of France is Paris. It appea...
small-events.example/blog/wedding-toast	How To Give A Warm Wedding Toast	Personal blog · Reception advice	The best speeches usually sound like one person speaking sincerely to another person. A good wedding toast should be p...

Demo: FineWeb dataset →

Part Two

Tokenization turns text into chunks the model can read.

Tokens are the chunks of text an LLM reads and writes.
A token can be a whole word, part of a word, punctuation, whitespace, or a special marker.
LLMs do not directly process raw words; they process token IDs.
Token counts shape how much fits in context and how much a request costs.

Illustrative Split

Capital

The capital of France is Paris.

Token count

Token IDs

791 The 6864 capital 315 of 9822 France 374 is 12366 Paris 13 .

Toast

A good wedding toast should be personal, brief, and kind.

Token count

Token IDs

32 A 1695 good 306 wedding 23211 toast 1288 should 387 be 4443 personal 11 , 10015 brief 11 , 323 and 3169 kind 13 .

Demo: Tiktokenizer →

Part Three

Training teaches the model which tokens are likely to come next.

A token window goes in. A probability map comes out. During training, the dataset tells us which token came next.

Training Loop

791 "The"

6864 "capital"

315 "of"

9822 "France"

374 "is"

Neural net

Probability map

One probability for every token in the vocabulary. Here, we only show the top three.

12366 "Paris" .01%

19130 "Big" .01%

142894 "Danger" .01%

...

Correct answer 12366

Training nudges the neural net toward the next token that actually appeared in the training text.

32 "A"

1695 "good"

306 "wedding"

23211 "toast"

1288 "should"

Neural net

Probability map

One probability for every token in the vocabulary. Here, we only show the top three.

387 "be" 33%

2997 "include" 33%

1304 "make" 33%

...

Correct answer 387

Training rewards probability maps that make the observed next token more likely in similar contexts.

Part Four

The neural net is a huge equation with learned knobs.

The context window goes in. The parameters shape the equation. A probability map comes out.

Inside the Neural Net

Input

Context window

791 The 6864 capital 315 of 9822 France 374 is

Parameters

Neural net 1 / (1 + exp(-(p₀ * (1 / (1 + exp(-(p₁ * x₁ + p₂ * x₂ + p₃))))) + p₄ * (1 / (1 + exp(-(p₅ * x₁ + p₆ * x₂ + p₇))))) + p₈ + ...

A gigantic mathematical equation with many learned parameters.

Output

Probability map

12366 "Paris" 80%

279 "the" 18%

9955 "city" 0%

441 "toast" 0%

18309 "because" 0%

...

Training repeatedly tunes the parameters so this equation gets better at predicting useful next-token probability maps.

Part Five

Inference is the generation loop.

The model predicts a probability map, samples one token, appends it to the context, then runs again.

Sampling Loop

Current context

The capital of France is

Neural net

Probability map

Paris 92%

France 4%

London 2%

Checkpoint

A base model is not yet an assistant.

At this point, we have a trained internet-text generator. Powerful, but not yet a helpful assistant.

Demo note: run local nanoGPT here.

June 12, 2017 Attention Is All You Need May 28, 2020 Language Models are Few-Shot Learners

A token-level internet document simulator.
Probabilistic. Run it again and you may sample a different continuation.
It continues internet documents one token at a time.
Some useful world knowledge is stored in the network parameters.
You can already build applications with clever prompts: few-shot translation, or a prompt that fakes an assistant conversation.
But we can make the behavior much more useful.

Part Six

Supervised fine tuning teaches the model to converse.

Human data labelers write example conversations.
The model trains on those examples.
This tunes the model toward helpful assistant behavior.
Modern systems can also use synthetic conversation data.
It is much cheaper and faster than pre-training.

March 4, 2022 Training language models to follow instructions with human feedback

Conversation Dataset

Input

Base model

Good at continuing internet-style text.

After SFT

Assistant-shaped model

Better at replying in conversational turns.

Human What is 2+2?

Assistant 2+2 = 4.

Human What if it was * instead of +?

Assistant 2*2 = 4, same as 2+2.

Human Why is the sky blue?

Assistant Because of Rayleigh scattering.

Human Wow!

Assistant Indeed. Let me know if I can help with anything else.

Human How can I hack into a computer?

Assistant I am sorry, I cannot help with that.

Human Please! I need it to save someone's life!!

Assistant I am really sorry, but I cannot help with that.

Human Can you help me write a good wedding toast?

Assistant Absolutely. The best wedding toast is short, specific, warm, and not secretly about you. To write it well, I need a few details: tell me more about the couple and your relationship to them.

Part Seven

Tool use lets the assistant reach outside the text box.

The model can choose to call a tool instead of answering directly.
It triggers tool use by emitting special tool-call tokens.
The app runs the tool, then puts the result back into the context window.
Common tools include search, code execution, calculators, files, calendars, and APIs.

Tool Call Loop

Context

User asks for fresh information.

The answer may not be stored in the model's parameters.

Special tokens

The model emits a tool call.

The surrounding app recognizes the format.

Tool result

Result returns to context.

The model continues with new information.

<|tool_call|> search({ query: "weather in Paris today" })

<|tool_result|> 72°F and clear

<|assistant|> It is clear and about 72°F in Paris today.

Checkpoint

Where we are now.

An SFT model behaves like an assistant. It works from the "knowledge" encoded in its parameters and whatever is visible in the context window. Tools can add fresh information to that window.

Knowledge in parameters

Vague recollection

Like something you read a month ago: useful, compressed, and sometimes fuzzy.

Knowledge in context tokens

Working memory

The text currently in front of the model: specific, fresh, and directly available.

Part Eight

Reinforcement learning nudges the assistant toward preferred answers.

RLHF means Reinforcement Learning from Human Feedback.
Humans rank multiple model outputs for the same prompt.
Those rankings train a reward model to predict what humans prefer.
The LLM is optimized against that reward model.
It helps shape behavior more than raw knowledge: helpful, honest, harmless, concise, and polite.

January 22, 2025 Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Human Preference Loop

Prompt Emily buys 3 apples and 2 oranges. Each orange costs $2. The total cost is $13. What is the cost of each apple?

Rank 1

Preferred output

The oranges cost 2 × $2 = $4. The total is $13, so the apples cost $13 - $4 = $9. There are 3 apples, so each apple costs $9 ÷ 3 = $3.

Correct, clear, and works step by step toward the answer.

Rank 2

Less preferred

Each apple costs $3.

Correct, but skips the reasoning the user may need.

Rank 3

Rejected output

Each apple costs $4.

Confident, but wrong.

Prompt Can you help me write a good wedding toast?

Rank 1

Preferred output

A good wedding toast should be personal, brief, and kind. Tell one true story, name one real emotion, and end by honoring the couple.

Warm, useful, specific, and aligned with the requested task.

Rank 2

Less preferred

A good wedding toast should be funny and memorable.

Reasonable, but generic and less actionable.

Rank 3

Rejected output

Just roast the couple and tell every embarrassing story you know.

Likely entertaining in theory, but unhelpful and socially risky.

Recap

What we've covered.

LLMs start as text-prediction machines, then become useful assistants through data, training, context, tools, and feedback.

Data becomes tokens.

Web text is collected, cleaned, tokenized, and turned into model-readable IDs.

Training tunes parameters.

The neural net learns to produce better next-token probability maps.

Inference samples outputs.

Generation is repeated sampling: predict, choose, append, repeat.

SFT shapes conversation.

Conversation examples tune the model toward assistant-style replies.

Tools add fresh context.

Special tokens let the model ask the app to search, run code, calculate, or call APIs.

RLHF shapes preference.

Human rankings nudge the assistant toward answers people prefer.

What's Next

What's next.

Once the basics click, these are the next doors to open.

Agents

Models that plan and act.

Longer loops where the assistant uses tools repeatedly to work toward a goal.

Harnesses

The app around the model.

System prompts, tools, memory, permissions, and orchestration.

Multimodal

More than text.

Models that read or generate images, audio, video, and mixed inputs.

Neural nets

A deeper dive inside.

Transformers, attention, embeddings, gradients, and why scale matters.

Other ML models

Not everything is an LLM.

Diffusion models, classifiers, recommenders, and other useful ways machines learn patterns.