How LLMs Actually Work Without the Hype

Large language models create a strange first impression. You ask a question, and the reply can feel thoughtful, fluent, and even oddly confident. That makes it easy to project human qualities onto the system. It feels like it understands. It feels like it knows. It feels like it is "thinking."

That feeling is useful for product adoption and terrible for clear understanding.

If you want to build with LLMs, you need a more grounded mental model. Not because the math has to become your whole life, but because bad mental models lead to bad systems. If you think the model "knows what you mean," you will under-specify prompts. If you think it "understands like a person," you will trust it too much. If you think it is basically a search engine, you will miss what makes it powerful.

So let's strip the topic down to the parts that actually matter.

The Short Version

An LLM is a model trained to predict the next token in a sequence.

That sentence sounds too simple because the system's behavior is much richer than it looks. But the core mechanism really is prediction. Given a sequence of text, the model estimates what token is most likely to come next. Then it repeats that process again and again until it has produced a full response.

That is the engine underneath the magic.

The reason this leads to useful answers is that language contains a huge amount of structure. If a model sees enough text and enough examples during training, next-token prediction starts to absorb patterns about grammar, style, facts, code, reasoning traces, formats, and common relationships between concepts.

Still, "absorbing patterns" is not the same as human understanding. That distinction matters.

What Is a Token?

People often say LLMs work with words. Close, but not exactly.

LLMs work with tokens, which are chunks of text. A token might be:

A whole short word
Part of a longer word
Punctuation
A number
A piece of code
Even a leading space

For example, the phrase:

Artificial intelligence is useful.

is not necessarily processed as four neat words. It may be split into several subword pieces depending on the tokenizer.

Why does this matter?

Because many practical limits are token-based:

Input size is measured in tokens
Output size is measured in tokens
Pricing is often measured in tokens
Context windows are measured in tokens

When people say "this model supports a large context window," they mean it can handle a large number of tokens in one request, not an unlimited amount of meaning.

Training vs Inference

This is one of the most important distinctions in modern AI.

Training

Training is the expensive, large-scale process where the model learns from vast amounts of text and code. During training, the model adjusts its internal parameters so it gets better at predicting tokens.

This stage is where most of the heavy learning happens. It takes enormous compute, large datasets, and careful engineering. Most developers never do this part themselves.

Inference

Inference is what happens when you use the model. You send a prompt. The model processes it and generates an answer token by token.

This is the part you interact with in an API or chat product.

A simple way to think about it:

Training is when the model becomes what it is
Inference is when you ask that trained model to do work

If you are building applications, inference is your day-to-day concern. You usually consume a pre-trained model rather than train one from scratch.

What the Transformer Actually Changed

Most modern LLMs are built on the transformer architecture. You do not need the full math to understand why it mattered.

Before transformers, earlier approaches struggled to handle long-range relationships in language. Transformers improved this by letting the model weigh which parts of the input matter most when generating the next token.

That mechanism is often described through attention.

Here is the practical meaning of attention:

When the model is generating the next token, it can look across the input and learn which earlier tokens are relevant. If you mention "Paris" earlier and later ask about "that city," the model can connect those parts. If you define a JSON format at the top of the prompt, the model can condition on that structure while generating the answer.

That does not mean it has perfect memory or reasoning. It means it has a better way to use context.

Context Window: Powerful, but Not Magical

The context window is the amount of text the model can consider in a single request.

Think of it as the working memory for that interaction. Everything inside the current context can influence the response:

System instructions
Conversation history
Retrieved documents
Tool results
User message

Everything outside that context is invisible unless you send it again.

This is where many beginner misunderstandings come from. If a chat app remembers something from five turns ago, that usually means the earlier text is still in context, or the product has stored and re-injected a summary. The model is not "remembering" the way a person does across time unless the system explicitly gives it memory.

Large context windows are useful, but they do not solve everything:

More context can increase cost and latency
Irrelevant context can make answers worse
Important details can still get diluted in long prompts
The model can still misread or ignore crucial facts

More room is helpful. It is not the same as better judgment.

Temperature, Sampling, and Why Outputs Vary

When a model generates text, it usually does not always pick the single most likely next token in a rigid way. There is a sampling process.

One common control is temperature.

Lower temperature generally makes outputs more predictable and conservative
Higher temperature generally makes outputs more varied and creative

This is useful because different tasks want different behavior.

For example:

If you want structured extraction, lower temperature is usually better
If you want creative brainstorming, a bit more variation can help

But temperature is not a truth knob. Turning it down does not make the model factual. It mostly changes how deterministic the generation feels.

System Prompts, User Prompts, and Roles

When you use an LLM through an API, you usually send messages with roles such as:

system
user
assistant

The system prompt is where you define the behavior, constraints, or identity of the assistant. The user prompt is the task or question. The prior assistant messages shape the ongoing conversation.

This is not just UI decoration. It affects how the model interprets the request.

For example, these two instructions create very different behavior:

[
  {
    "role": "system",
    "content": "You are a careful technical assistant. Be concise. If something is uncertain, say so."
  },
  {
    "role": "user",
    "content": "Explain what a vector database does."
  }
]

versus

[
  {
    "role": "system",
    "content": "You are a confident startup founder. Answer in a punchy, opinionated tone."
  },
  {
    "role": "user",
    "content": "Explain what a vector database does."
  }
]

Same user question, different system behavior.

This is one reason product builders should stop thinking only in terms of one input box. Behind a good AI application, the prompt is usually structured.

What LLMs Are Good At

A grounded view is not a cynical view. LLMs are genuinely useful.

They are especially strong at:

Rewriting and transforming text
Summarizing large bodies of language
Extracting structure from messy input
Translating between formats
Generating drafts
Explaining concepts at different levels
Code generation and code completion
Pattern-matching across natural language and code

They are often valuable because they compress a lot of language skill into one interface. Instead of building separate systems for summarization, classification, extraction, rewriting, and drafting, you can often use one model with the right instructions.

That flexibility is real.

Where LLMs Fail

This part matters even more.

They Can Hallucinate

A model can produce statements that sound specific and polished but are simply wrong. Sometimes the error is a fabricated fact. Sometimes it is a subtle misreading. Sometimes it is the right structure with the wrong details inside it.

They Do Not Truly Understand in a Human Sense

They model patterns in language extremely well. That can look like understanding. But there is no guarantee that the internal process matches human reasoning, common sense, or grounded world models the way we intuitively imagine them.

They Are Sensitive to Framing

Small changes in prompt wording, context ordering, or examples can shift output quality a lot.

They Are Weak at Reliability Without Support

If you need guaranteed correctness, strict compliance, repeatability, or up-to-date private knowledge, the model alone is not enough. You need system design around it.

The Most Useful Mental Model

If I had to give one beginner-friendly description, it would be this:

An LLM is a highly compressed pattern engine for language and code.

It has seen enough examples to imitate many useful behaviors:

answering
summarizing
translating
drafting
formatting
explaining
classifying

But it does all of that through learned statistical patterns, not because it has a stable human-like understanding of the world.

That mental model helps in two ways:

It explains why the model can be incredibly useful.
It explains why you should not trust it blindly.

Why Prompting Matters, but Is Not Everything

Because the model is sensitive to context, prompt design matters a lot. Clear instructions often produce dramatically better results than vague ones.

Compare:

Explain RAG.

with:

Explain RAG for a software engineer who knows APIs but is new to AI.
Use plain English, give one concrete example, and end with three common implementation mistakes.

The second prompt gives the model more structure, audience context, and output constraints. That usually leads to a better answer.

Still, prompting is only part of the story. If your system needs company knowledge, you need retrieval. If it needs action-taking, you need tools. If it needs reliability, you need evaluation and guardrails. Prompting is powerful, but it is not a substitute for architecture.

Why These Models Feel Smarter Than Earlier Software

This is worth addressing directly.

Traditional software usually follows explicit rules. LLM-based systems can generalize across many tasks using the same interface. That makes them feel unusually flexible. You can ask for a summary, an email draft, a bug explanation, a SQL query, or a policy rewrite in the same session.

That flexibility is what changed the experience for most people. It felt less like using software and more like interacting with a capable assistant.

But the interface can hide the limits. Fluent language creates the illusion of stable competence. In reality, the system is best understood as a powerful but imperfect probabilistic engine.

That does not make it fake. It makes it something you should design around carefully.

What This Means for Builders

If you are building with LLMs, a few habits go a long way:

Be explicit in instructions
Ask for structured output when possible
Test edge cases, not just happy paths
Keep prompts focused
Use retrieval for domain knowledge
Add validation where correctness matters
Treat confidence and fluency as separate from truth

This is the difference between being impressed by the model and building responsibly with it.

References & Further Reading

Closing Thoughts

The healthiest way to think about LLMs is neither worship nor dismissal. They are not fake, and they are not magic. They are powerful sequence models that can do surprisingly useful work because language contains structure, and modern training has captured a lot of that structure at scale.

In the next post, we will move from understanding the model to actually using one as a builder. That means APIs, message roles, stateless requests, and the mindset shift from "I use ChatGPT" to "I can build an AI-powered app."

Next in the series: From ChatGPT User to AI Builder: Your First API-Powered App.