Large language models create a strange first impression. You ask a question, and the reply can feel thoughtful, fluent, and even oddly confident. That makes it easy to project human qualities onto the system. It feels like it understands. It feels like it knows. It feels like it is "thinking."
That feeling is useful for product adoption and terrible for clear understanding.
If you want to build with LLMs, you need a more grounded mental model. Not because the math has to become your whole life, but because bad mental models lead to bad systems. If you think the model "knows what you mean," you will under-specify prompts. If you think it "understands like a person," you will trust it too much. If you think it is basically a search engine, you will miss what makes it powerful.
So let's strip the topic down to the parts that actually matter.
An LLM is a model trained to predict the next token in a sequence.
That sentence sounds too simple because the system's behavior is much richer than it looks. But the core mechanism really is prediction. Given a sequence of text, the model estimates what token is most likely to come next. Then it repeats that process again and again until it has produced a full response.
That is the engine underneath the magic.
The reason this leads to useful answers is that language contains a huge amount of structure. If a model sees enough text and enough examples during training, next-token prediction starts to absorb patterns about grammar, style, facts, code, reasoning traces, formats, and common relationships between concepts.
Still, "absorbing patterns" is not the same as human understanding. That distinction matters.
People often say LLMs work with words. Close, but not exactly.
LLMs work with tokens, which are chunks of text. A token might be:
For example, the phrase:
Artificial intelligence is useful.
is not necessarily processed as four neat words. It may be split into several subword pieces depending on the tokenizer.
Why does this matter?
Because many practical limits are token-based:
When people say "this model supports a large context window," they mean it can handle a large number of tokens in one request, not an unlimited amount of meaning.
This is one of the most important distinctions in modern AI.
Training is the expensive, large-scale process where the model learns from vast amounts of text and code. During training, the model adjusts its internal parameters so it gets better at predicting tokens.
This stage is where most of the heavy learning happens. It takes enormous compute, large datasets, and careful engineering. Most developers never do this part themselves.
Inference is what happens when you use the model. You send a prompt. The model processes it and generates an answer token by token.
This is the part you interact with in an API or chat product.
A simple way to think about it:
If you are building applications, inference is your day-to-day concern. You usually consume a pre-trained model rather than train one from scratch.
Most modern LLMs are built on the transformer architecture. You do not need the full math to understand why it mattered.
Before transformers, earlier approaches struggled to handle long-range relationships in language. Transformers improved this by letting the model weigh which parts of the input matter most when generating the next token.
That mechanism is often described through attention.
Here is the practical meaning of attention:
When the model is generating the next token, it can look across the input and learn which earlier tokens are relevant. If you mention "Paris" earlier and later ask about "that city," the model can connect those parts. If you define a JSON format at the top of the prompt, the model can condition on that structure while generating the answer.
That does not mean it has perfect memory or reasoning. It means it has a better way to use context.
The context window is the amount of text the model can consider in a single request.
Think of it as the working memory for that interaction. Everything inside the current context can influence the response:
Everything outside that context is invisible unless you send it again.
This is where many beginner misunderstandings come from. If a chat app remembers something from five turns ago, that usually means the earlier text is still in context, or the product has stored and re-injected a summary. The model is not "remembering" the way a person does across time unless the system explicitly gives it memory.
Large context windows are useful, but they do not solve everything:
More room is helpful. It is not the same as better judgment.
When a model generates text, it usually does not always pick the single most likely next token in a rigid way. There is a sampling process.
One common control is temperature.
This is useful because different tasks want different behavior.
For example:
But temperature is not a truth knob. Turning it down does not make the model factual. It mostly changes how deterministic the generation feels.
When you use an LLM through an API, you usually send messages with roles such as:
systemuserassistantThe system prompt is where you define the behavior, constraints, or identity of the assistant. The user prompt is the task or question. The prior assistant messages shape the ongoing conversation.
This is not just UI decoration. It affects how the model interprets the request.
For example, these two instructions create very different behavior:
[
{
"role": "system",
"content": "You are a careful technical assistant. Be concise. If something is uncertain, say so."
},
{
"role": "user",
"content": "Explain what a vector database does."
}
]
versus
[
{
"role": "system",
"content": "You are a confident startup founder. Answer in a punchy, opinionated tone."
},
{
"role": "user",
"content": "Explain what a vector database does."
}
]
Same user question, different system behavior.
This is one reason product builders should stop thinking only in terms of one input box. Behind a good AI application, the prompt is usually structured.
A grounded view is not a cynical view. LLMs are genuinely useful.
They are especially strong at:
They are often valuable because they compress a lot of language skill into one interface. Instead of building separate systems for summarization, classification, extraction, rewriting, and drafting, you can often use one model with the right instructions.
That flexibility is real.
This part matters even more.
A model can produce statements that sound specific and polished but are simply wrong. Sometimes the error is a fabricated fact. Sometimes it is a subtle misreading. Sometimes it is the right structure with the wrong details inside it.
They model patterns in language extremely well. That can look like understanding. But there is no guarantee that the internal process matches human reasoning, common sense, or grounded world models the way we intuitively imagine them.
Small changes in prompt wording, context ordering, or examples can shift output quality a lot.
If you need guaranteed correctness, strict compliance, repeatability, or up-to-date private knowledge, the model alone is not enough. You need system design around it.
If I had to give one beginner-friendly description, it would be this:
An LLM is a highly compressed pattern engine for language and code.
It has seen enough examples to imitate many useful behaviors:
But it does all of that through learned statistical patterns, not because it has a stable human-like understanding of the world.
That mental model helps in two ways:
Because the model is sensitive to context, prompt design matters a lot. Clear instructions often produce dramatically better results than vague ones.
Compare:
Explain RAG.
with:
Explain RAG for a software engineer who knows APIs but is new to AI.
Use plain English, give one concrete example, and end with three common implementation mistakes.
The second prompt gives the model more structure, audience context, and output constraints. That usually leads to a better answer.
Still, prompting is only part of the story. If your system needs company knowledge, you need retrieval. If it needs action-taking, you need tools. If it needs reliability, you need evaluation and guardrails. Prompting is powerful, but it is not a substitute for architecture.
This is worth addressing directly.
Traditional software usually follows explicit rules. LLM-based systems can generalize across many tasks using the same interface. That makes them feel unusually flexible. You can ask for a summary, an email draft, a bug explanation, a SQL query, or a policy rewrite in the same session.
That flexibility is what changed the experience for most people. It felt less like using software and more like interacting with a capable assistant.
But the interface can hide the limits. Fluent language creates the illusion of stable competence. In reality, the system is best understood as a powerful but imperfect probabilistic engine.
That does not make it fake. It makes it something you should design around carefully.
If you are building with LLMs, a few habits go a long way:
This is the difference between being impressed by the model and building responsibly with it.
The healthiest way to think about LLMs is neither worship nor dismissal. They are not fake, and they are not magic. They are powerful sequence models that can do surprisingly useful work because language contains structure, and modern training has captured a lot of that structure at scale.
In the next post, we will move from understanding the model to actually using one as a builder. That means APIs, message roles, stateless requests, and the mindset shift from "I use ChatGPT" to "I can build an AI-powered app."
Next in the series: From ChatGPT User to AI Builder: Your First API-Powered App.