Large language model (LLM)

A large language model (LLM) is a deep learning model trained on vast amounts of text.

At a fundamental level, LLMs work by predicting the probability of each possible next token in a sequence, then sampling from that distribution. Under the hood, an LLM is a sophisticated autocomplete: you give it some initial text, and it keeps writing, one token at a time, based on what it calculates to be the most likely continuation. Feed it "I think, therefore I…" and it will almost always complete it with "am", because that continuation is overwhelmingly the most likely one across its training data. Feed it something more unusual and the completions vary from run to run.

But from this primitive process unexpected capabilities arise. At scale, LLMs demonstrate the ability to reason, summarize, translate, write code, and hold human-like conversations. These unexpected capabilities are known as emergence.

The frontier models of the 2020s have demonstrated abilities that appear to go beyond their training data — they can play chess, or show empathy, at a near-human level. The biggest emergent quality of all is the ability to solve problems.

No one is really sure why a token-prediction system has come to have such human-like capabilities, and we may never know. The neural network that underpins an LLM has hundreds of billions of connections between its artificial neurons, and some of these connections may be invoked many times over during the processing of a single piece of text. Exploring this internal behavior may simply be too complex for the human brain to ever fully understand.

Part of why the output feels so human is simply that the training data is human: models learn from our own conversations, writing, and creative works, and so come to mirror the way we communicate.

LLMs are a category of generative AI. They are increasingly deployed across a wide variety of contexts. AI assistants are perhaps the best-known application built on top of LLMs, but many other use cases are being found.

Transformer architecture

Modern LLMs are built on the transformer architecture, which was introduced in Google’s 2017 research paper "Attention Is All You Need".

The core innovation of the transformer architecture is the attention mechanism. This is what allows a model to weigh the relevance of every token against every other token in the current context when predicting the next token. This enables transformers to capture long-range dependencies in text far more effectively than earlier architectures. See transformer architecture for a full explanation.

Pre-training

In the development of LLMs, training is the computationally enormous process of adjusting a model’s weights by exposing the model to vast amounts of data.

Training a frontier model requires data centers full of specialized hardware running for weeks, or even months. Training demands maximum compute throughput, and so benefits enormously from features like high-bandwidth inter-GPU connections.

There are multiple stages of training. Pre-training is the process of building a model’s foundational knowledge by exposing it to a broad, general-purpose corpus of data. The model learns to predict the next token in a sequence, and in doing so it internalizes the statistical patterns, semantic associations, and structure of the training data.

Crucially, this is unsupervised (or, more precisely, self-supervised): unlike earlier forms of AI, pre-training needs no carefully labelled data. The model learns to recognize the patterns, structures, and context of human language directly from raw text, which is what makes it feasible to train on a corpus this large.

The result is a base model with broad capabilities but no particular alignment to how it should be used. In this raw state a model simply reflects back the material it was trained on — including the biases, errors, and falsehoods in that material — and it has no ethical boundaries of its own. Judgment is a layer added afterwards, during fine-tuning (see below), where AI labs work to build in ethical boundaries and design out biases.

It is worth being clear-eyed about the limits of this. A model is ultimately just a massive vector of numerical parameters, so there is no way to guarantee that a model will behave safely on its own. Fine-tuning shifts the odds but proves nothing. In practice, safety depends as much on the wrapping infrastructure around the model — the harness, guardrails, and monitoring — as on the model’s own training.

Post-training

Post-training refers to all the additional steps applied to a model after pre-training is complete, to make it more useful, safe, and aligned with particular use cases such as computer programming.

Key post-training techniques include fine-tuning and RLHF, both described below.

The resulting model will still be limited by the quality and breadth of its training data. A model’s knowledge of semantic associations is learnt from its training data. The context in which words are used in the training data shapes those semantic associations in the model. For example, the word "bank" means something different when preceded by "river" or "central". This is fundamental to how the transformer architecture works.

Larger LLMs can potentially make deeper contextual connections than smaller ones, because they have more parameters linked in more multidimensional ways, producing a richer map of semantic relationships.

But improving the quality and capabilities of models is not only about scaling up model size (to hold more semantic relationships). Much of the current research and development at the frontier is focused less on scaling the raw volume of training data and more on improving the quality of training data — carefully curated, filtered, and increasingly synthetic datasets that yield more capable models without a proportional increase in size or compute.

The training-data ceiling

A model’s capability is bounded by the quantity, quality, and diversity of the data it was trained on. This makes access to good training data one of the central constraints on how far LLMs can continue to improve, and the search for more, better, and more diverse sources has become a hot topic in its own right.

Two pressures are tightening this constraint. The first is supply: the frontier labs are already approaching the limits of the high-quality public text available to train on, so simply scraping more of the web yields diminishing returns. The second is legal: shifting copyright law and licensing norms may narrow what labs are permitted to train on at all, putting further pressure on the available pool.

These pressures are a large part of why the frontier has turned toward data quality over raw volume — heavier curation and filtering, licensed and proprietary datasets, and increasingly synthetic data generated by other models (see distillation, below).

Fine-tuning

Fine-tuning is the process of taking a pre-trained model and continuing to train it on a smaller, task-specific dataset to adapt its behavior for a particular domain or use case.

Fine-tuning is how general-purpose base models are turned into specialized assistants. A model fine-tuned on medical literature will perform better on clinical tasks than the same base model used out-of-the-box. It can also be done for an individual customer on their own data — for example, fine-tuning on a company’s customer-service transcripts to match its house style and handle its particular queries. The common thread is customizing a model for a specific use case, typically by continuing to train it on a set of high-quality examples.

The examples used for fine-tuning need not be hand-curated up front. They can also come from production usage — most simply, from the thumbs-up and thumbs-down reactions users give to a model’s responses. This feedback signals which answers were good and which were not, and can be fed back in to refine the model, closing the loop between how a model is used and how it is subsequently improved.

Fine-tuning is commonly used to create new models. Rather than training a new model from scratch — which requires enormous compute and data — fine-tuning starts from an existing model’s weights and adjusts them incrementally.

A specialized form of fine-tuning is instruction tuning, which is training a base model on examples of instructions paired with desired responses. Most frontier models also undergo reinforcement learning from human feedback (RLHF), which can be thought of as a highly labor-intensive form of instruction tuning — see below.

LoRA (Low-Rank Adaptation), introduced by Microsoft researchers in 2021, is the most common fine-tuning technique for open-weight models such as Llama and Qwen. Instead of updating a model’s billions of weights — which is expensive, slow, and produces huge files — LoRA freezes the original weights and trains small "adapter" matrices alongside them. The upshot is that a LoRA for a 7B model might be 10–200 MB instead of 14 GB, trainable in hours on a single consumer GPU, and swappable in and out at inference time.

Variants include QLoRA (Quantized LoRA) and merged LoRAs, where the changes are baked back into the base weights.

RLHF

Reinforcement learning from human feedback (RLHF) is a post-training technique that aligns a model’s behavior with human preferences, making it safer, more helpful, and less likely to produce harmful outputs.

This is the stage that brings humans directly into the loop. A workforce — ranging from highly-paid domain experts to large numbers of low-paid contractors — reads the model’s answers and rates them for qualities like helpfulness, accuracy, and safety. That feedback reinforces good answers and discourages bad ones, and is used to refine the model.

The process works in two stages. First, human evaluators compare pairs of model outputs and select which is better. This data is used to train a separate reward model that learns to predict human preferences. Second, the main model is trained via reinforcement learning to maximize scores from the reward model. Because the second stage optimizes against the reward model rather than humans directly, it scales without requiring continuous human input.

RLHF is how base models are transformed into helpful assistants that follow instructions. Most frontier models use RLHF or a close variant (such as constitutional AI, RLAIF, or DPO) as part of their post-training pipeline.

Prompt-based adaptation

A lightweight alternative to fine-tuning is prompt-based adaptation. This is the process of crafting system prompts to steer model behavior without changing its weights.

The system prompt is a set of instructions provided as input to a new model session. It is typically hard-coded by application developers, rather than being something the end user inputs. Ollama’s Modelfiles are an example of prompt-based adaptation.

Prompt-based adaptation sees the optimizations being baked into the application layer that wraps a model, rather than into the underlying models themselves. This is simpler but less powerful than fine-tuning.

Knowledge cutoff

Because training is a one-time process, a model’s knowledge is frozen at the date its training data was collected. The model has no awareness of events, publications, or other changes in the world that occurred after that date. This is known as the knowledge cutoff. It is one of the core limitations of LLMs, and the primary motivation for augmenting LLMs with techniques such as retrieval-augmented generation (RAG) — see below.

Tokens

LLMs do not process text character-by-character or word-by-word. Instead, their base unit is a token.

A token is a chunk of text, typically a word, part of a word, or a punctuation character. Common words usually fit into a single token. Rarer or longer words may be split into multiple tokens. As a rough guide, one token is approximately four characters, or three-quarters of the average word in English.

Tokens are the unit of measurement for both input (the prompt) and output (the response). LLM pricing is typically quoted per million tokens (eg. $1.50/1M input, $3.25/1M output). A model’s context window (see below) and other capabilities are also measured in tokens.

Embeddings

Before a model can process tokens, the tokens must first be converted into a numerical form. An embedding is this numerical representation of a token. An embedding is a vector — a list of floating-point numbers — where each value captures some aspect of the token’s meaning.

Tokens with similar meanings produce vectors that are close together in this multi-dimensional space. This is how a model captures semantic and syntactic relationships between words.

Weights

Weights are a model’s learned parameters — the billions of numbers that encode everything it learned during training, and that determine how the model transforms inputs into output. Critically, these weights are learned from the training data, not explicitly programmed by hand. It is the combination of a vast training corpus and billions of such weights that lets a model emulate human communication so closely.

An LLM with 100 billion parameters has roughly that many weights, stored in memory or on disk.

Small changes in weights, eg. through fine-tuning, can dramatically alter a model’s behavior.

Mixture of experts

Mixture-of-experts (MoE) is a model architecture in which, rather than activating the entire network for every token, each token is routed through only a small subset of specialized "expert" subnetworks. This lets a model carry a very large total parameter count while activating only a fraction of it per token — high capacity at a lower inference cost.

Quantization

Quantization is a technique for reducing the numerical precision of a model’s weights, for example representing values with 4-bit integers instead of 16-bit or 32-bit floats. This reduces memory footprint and speeds up inference (see below), at the cost of a small reduction in output quality.

Quantization is what makes it practical to run large models on consumer hardware with limited VRAM.

Distillation

Distillation, properly known as knowledge distillation, is a technique for transferring the capabilities of a large, capable model into a smaller, cheaper one.

The large model is the teacher and the smaller model is the student. Rather than training the student from scratch on raw data, it is trained to imitate the teacher.

There are two broad approaches to distillation:

White-box: The distiller has full access to the teacher’s internals, including its output probabilities. This gives the richest training signal, and is used by labs distilling their own models.
Black-box: The distiller has access only to the teacher’s text outputs, typically through an API. The student is trained on synthetic data and prompt–response pairs generated by the teacher.

The result is a model that is far smaller and faster to run, yet retains much of the teacher’s capability. This is how many light models are produced — Google’s Gemini Flash models, for example, are distilled from the larger Gemini Pro models.

Black-box distillation of a competitor’s proprietary model is contentious. It generally violates the terms of service of the model being copied, and has become a flashpoint in the industry. DeepSeek R1 — the first highly-capable open-weight frontier reasoning model — was widely suspected of distilling OpenAI’s models. During the Musk versus Altman trial in 2026, Elon Musk acknowledged that xAI had distilled OpenAI’s models while training Grok.

Inference

Inference is what an LLM does to generate output — running data through the model’s fixed weights to produce each next token.

Inference is distinct from training, fine-tuning, and quantization, which create a model in the first place. Inference is how a model, once made, works. The term comes from logic and statistics, where "inference" means drawing a conclusion from available information.

When you type a prompt and get a response back, the whole behind-the-scenes process — loading the model, tokenizing your input, generating each output token — is inference.

Compared to training, inference is relatively cheap, and depending on model size can run on a single workstation GPU. Available VRAM is the primary constraint on both the size of model you can run and its throughput, measured in tokens per second.

Context window

The context window is the maximum number of tokens (the context) an LLM can process in a single interaction. It is the combined total of the system prompt, conversation history, and the current input and output.

Text outside the context window is not visible to the model.

Context window size is one of the most impactful characteristics of models on the user experience. A larger context window allows a model to consider more of a document, a longer conversation history, or a larger codebase at once.

Modern frontier models support context windows of 128K tokens or more. Some now extend to 1M tokens and beyond.

The key-value cache (KV cache) stores the attention states of previously processed tokens so they don’t have to be recomputed for every new token. This trades memory for compute and is what makes long, interactive generation practical. See key-value cache (KV cache) for details.

Compaction

Compaction is the process of condensing a long conversation history into a shorter summary once it approaches the limits of the context window, so that a session can continue beyond what the context window would otherwise allow.

Rather than dropping the oldest messages outright, the model (or a separate summarization step) distills the prior exchanges into a compact representation of what matters — decisions made, facts established, work-in-progress — which then replaces the raw history.

The trade-off is fidelity for capacity. Compaction frees up space and keeps latency and cost bounded as a conversation grows, but it is inherently lossy. Detail that seemed unimportant at summarization time is discarded and cannot be recovered, and errors introduced into the summary propagate silently through everything that follows.

Compacting too aggressively risks losing context the model still needs. Compacting too rarely wastes tokens and eventually hits the hard context limit.

It also has a compute cost of its own, since generating the summary is itself an inference pass.

Well-designed systems therefore compact selectively — preserving verbatim what is cheap and load-bearing (such as the system prompt and recent turns) while summarizing the older, bulkier middle of the conversation.

Context rot

In the transformer architecture, every token can attend to every other token across the entire context, producing n² pairwise relationships for n tokens. As the context window fills, a model’s ability to manage these pairwise relationships gets stretched thin. Thus, as the number of tokens in the context window increases, the model’s ability to accurately recall information from that context decreases — a concept known as context rot.

Context rot is not a hard cut-off. A model’s performance does not suddenly drop off when context size exceeds the model’s context window. We observe a performance gradient rather than a hard cut-off. Models remain capable with large contexts, but show reduced precision for information retrieval and long-range reasoning compared to their performance on shorter contexts.

This phenomenon is accentuated by optimization techniques like compaction.

This emergent behavior, known as a model’s attention budget, is analogous to working memory capacity in humans. We too can lose focus and get confused once we reach a certain point of information overload. Like people, some models exhibit more gentle context rot than others, but this behavior is observed across all models.

Context engineering is fundamentally about the problem of optimizing the signal-to-noise ratio of a model’s context.

Temperature and sampling

When generating a response, an LLM does not simply pick the most probable next token at every step. That would produce repetitive, deterministic output. Instead, the model samples from a probability distribution of possible next tokens. In other words, some degree of randomness is deliberately designed in.

Temperature is the parameter that controls how sharp or flat this distribution is. At low temperatures (close to 0), the model almost always picks the highest-probability token, producing focused and predictable output. At higher temperatures, lower-probability tokens become more likely to be chosen, producing more varied and creative — but also less reliable — output.

For factual or code-generation tasks, a low temperature is usually preferable. For creative writing, a higher temperature produces more interesting results.

Many models expose temperature as a configurable parameter.

Hallucinations

LLMs generate text by predicting the most statistically plausible next token, not by reasoning from facts. As a result, they can produce outputs that are fluent and confident but factually incorrect — a phenomenon known as hallucination.

A model may fabricate citations, invent plausible-sounding names, or state incorrect information as though it were certain. These are all examples of hallucinations.

Hallucinations are an inherent limitation of how LLMs work, not a bug that can be fully fixed. The practical mitigations are:

Setting lower temperatures to sample only the highest-probability next tokens.
Grounding model outputs in retrieved facts (RAG — see below).
Verifying outputs independently.

Computational irreducibility

Some tasks like multi-digit arithmetic, tracing the execution of a novel algorithm, or simulating a complex system step-by-step, cannot be shortcut using LLMs due to their inherent limitations.

LLMs are trained to recognize and extrapolate statistical patterns, not to carry out arbitrary chains of exact computation. This is why they can fail at tasks that seem "simple" to a conventional computer, such as multiplying two large numbers, while succeeding at tasks that are traditionally hard for computers to do, such as writing a coherent essay. It turns out that essay writing is more statistically patterned (and computationally shallower) than exact step-by-step computation.

Stephen Wolfram calls this property computational irreducibility.

This is one reason agentic systems pair LLMs with external tools such as calculators, code interpreters, and REPLs. The idea is to offload irreducible computation to a traditional system which can actually execute the computation, rather than asking the model to predict the answer token by token.

Retrieval-augmented generation (RAG)

A core limitation of LLMs is that their knowledge is frozen at the point of training. By default, models cannot access new information or browse the web. In addition, LLMs are prone to hallucination — generating fluent, confident-sounding responses that are factually incorrect — particularly when asked about topics outside their training data or at the edges of their knowledge.

Retrieval-augmented generation (RAG) is a technique that addresses these issues. It combines an LLM with an external knowledge source, which is retrieved at inference time and loaded into the context window.

In a RAG system, a user’s query is used to search a knowledge base — a vector database, a search index, or live web results. The most relevant results are fetched and injected into the model’s context alongside the original query. The model then generates a response grounded in that retrieved context, rather than relying solely on what it learned during training.

RAG is widely used to build knowledge-aware applications — internal document Q&A systems, customer support bots, and AI search engines. Perplexity is a prominent consumer example. It uses live web search to answer questions with cited sources in real time. See AI assistants for other examples.

Agent development frameworks such as LangChain and LlamaIndex can be used to build RAG capabilities into agents, providing the retrieval, chunking, and orchestration plumbing so developers do not have to wire it together from scratch. See agents.

Evals

Evals (short for evaluations) are structured tests used to measure how well an LLM or LLM-powered application performs on specific tasks. They are the AI equivalent of regression tests.

An eval runs the model through a predefined set of inputs and compares its responses against expected outputs or criteria — such as factual correctness, safety, helpfulness, tone, or task completion. This allows developers to quantify progress, catch regressions when switching models or changing prompts, and guide improvements over time.

Evals are particularly important in production applications, where the quality of model outputs directly affects the user experience. They are often cited as the most critical, yet most overlooked, practice in building reliable AI products.

Open-source frameworks supply ready-made evaluation harnesses and shareable test suites. OpenAI Evals, for example, is a framework for evaluating LLMs paired with an open registry of reusable evals that anyone can contribute to.

The public leaderboards and benchmarks for models are created from standardized evals run across many models.

GPU compute

Both training and inference are computationally intensive and rely heavily on GPUs, which excel at the massively parallel matrix operations that underpin neural networks.

The dominant GPU programming frameworks are listed below. These are the software stacks that allow programs like llama.cpp or vLLM (see below) to access GPUs for general compute, rather than purely for graphics workloads.

CUDA: NVIDIA’s proprietary parallel computing platform, dominant in AI/ML workloads due to its maturity and deep integration with frameworks like PyTorch and TensorFlow. Requires NVIDIA hardware.
ROCm: AMD’s open-source GPU computing platform, their equivalent of CUDA for AMD hardware. Growing PyTorch and TensorFlow support, but still lags behind CUDA in ecosystem maturity.
OpenCL: An open standard for parallel computing across heterogeneous hardware (CPUs, GPUs, FPGAs) from any vendor. More portable than CUDA but less widely adopted in the AI ecosystem.

Inference engines

Inference engines are the low-level runtimes that actually execute a model — loading weights into memory and running the computations that produce each token. Model managers like Ollama and LM Studio (see below) are built on top of them.

The two most widely used inference engines for open-weight models are:

llama.cpp: A C/C++ library optimized for running quantized models on consumer hardware, including CPU-only machines, Apple Silicon, and devices with limited VRAM. This is the engine that powers Ollama and LM Studio (see below). The design prioritizes broad hardware compatibility over maximizing throughput.
vLLM: A Python-based inference engine designed for high-throughput via one or more GPUs. The key differentiating feature is that vLLM supports multi-GPU tensor parallelism, which means workloads are more evenly distributed across multiple GPUs. This efficiency shows up in better throughput metrics. The key innovation is "PagedAttention", which manages the KV cache very efficiently, which is what enabled the higher request concurrency.

llama.cpp and vLLM have different use cases. llama.cpp is optimized for single-GPU setups on consumer hardware, for a single user running models locally. vLLM is optimized for serving multiple users concurrently.

Llama.cpp can be made to use multiple GPUs, but it does not support tensor parallelism in the same way that vLLM does. In llama.cpp, model layers are assigned to GPUs rather than operations being distributed across them, which limits throughput gains from multi-GPU setups.

Model access layers

Although it is possible to interact with a model directly, for most use cases you reach it via an access layer, which handles authentication, response formatting, and other details specific to a particular provider or use case. See model access layer.

References

OpenAI Cookbook, OpenAI — Example code, recipes, and best practices for working with the OpenAI APIs.
LLM Engineer Handbook — Curated collection of links and resources for AI engineers.
What Is ChatGPT Doing … and Why Does It Work?, Stephen Wolfram — Accessible first-principles explanation of next-token prediction, embeddings, and the transformer’s attention mechanism, plus the concept of computational irreducibility.

Books

Build a Large Language Model (from Scratch), Sebastian Raschka — Builds a transformer in raw PyTorch, layer by layer.
Hands-On Large Language Models, Jay Alammar & Maarten Grootendorst — Visual, practical guide to LLM applications.
LLM Engineer’s Handbook, Paul Iusztin & Maxime Labonne — Production LLMOps: fine-tuning, quantization, serving.
The Hundred-Page Language Models Book, Andriy Burkov — Concise, math-grounded path from n-grams to transformers.