Large language model (LLM)
A large language model (LLM) is a deep learning model trained on vast amounts of text. At a fundamental level, LLMs work by predicting the probability of each possible next token in a sequence, then sampling from that distribution. But from this primitive process unexpected capabilities arise. At scale, LLMs demonstrate the ability to reason, summarize, translate, write code, and hold human-like conversations. These unexpected capabilities are known as emergence.
LLMs are a category of generative AI tools. LLMs are increasingly deployed in a wide variety of contexts. AI assistants are perhaps the best-known category of application software built on top of LLMs, but many other use cases are being found for LLMs.
This document covers how LLMs work, from training and inference to context windows and hallucinations. It also summarizes the current model landscape, with a focus on frontier models and open-weight alternatives. This document also covers the access layers through which we interact with models, which provide the front-ends for direct human interaction with models and for integrations with other automated processes.
How LLMs work
Transformer architecture
Modern LLMs are built on the transformer architecture, which was introduced in Google’s 2017 research paper "Attention Is All You Need".
The core innovation of the transformer architecture is the attention mechanism. This is what allows a model to weigh the relevance of every token against every other token in the current context when predicting the next token. This enables transformers to capture long-range dependencies in text far more effectively than earlier architectures. See transformer architecture for a full explanation.
Training
In the development of LLMs, training is the computationally enormous process of adjusting a model’s weights by exposing the model to vast amounts of data.
Training a frontier model requires data centers full of specialized hardware running for weeks, or even months. Training demands maximum compute throughput, and so benefits enormously from features like high-bandwidth inter-GPU connections.
Post-training
Post-training refers to all the additional steps applied to a model after pre-training is complete, to make it more useful, safe, and aligned with human intentions. Key post-training techniques include fine-tuning and RLHF, both described below.
Fine-tuning
Fine-tuning is the process of taking a pre-trained model and continuing to train it on a smaller, task-specific dataset to adapt its behavior for a particular domain or use case. Rather than training from scratch – which requires enormous compute and data – fine-tuning starts from an existing model’s weights and adjusts them incrementally.
Fine-tuning is how general-purpose base models are turned into specialized assistants. A model fine-tuned on medical literature will perform better on clinical tasks than the same base model used out-of-the-box.
A specialized form of fine-tuning is instruction tuning, which is training a base model on examples of instructions paired with desired responses. Most frontier models also undergo RLHF as part of their post-training pipeline – see below.
LoRA (Low-Rank Adaptation) is a fine-tuning technique for large neural networks introduced by Microsoft researchers in 2021. Instead of updating the billions of weights in a model – which is expensive, slow, produces huge files – LoRA freezes the original weights and trains small "adapter" matrices alongside them. The practical upshot is that a LoRA for a 7B-parameter model might be 10-200 MB instead of 14 GB, and can be trained in hours on a single consumer GPU rather than days on a cluster. The LoRA can also be swapped in and out at inference time. This technique is commonly used now to fine-tune open-weight models such as Llama and Qwen. Variants include QLoRA (Quantized LoRA) and merged LoRAs (where the changes are baked back into the base weights).
A lightweight alternative to fine-tuning is prompt-based adaptation. This is the process of crafting system prompts to steer model behavior without changing its weights. This is simpler but less powerful than fine-tuning. (The system prompt is a set of instructions provided to the model before a conversation begins. The system prompt is typically hard-coded by application developers, rather than being something the end user inputs. Ollama’s Modelfiles are an example of prompt-based adaptation.)
RLHF
Reinforcement learning from human feedback (RLHF) is a post-training technique that aligns a model’s behavior with human preferences — making it safer, more helpful, and less likely to produce harmful outputs.
The process works in two stages. First, human evaluators compare pairs of model outputs and select which is better. This data is used to train a separate reward model that learns to predict human preferences. Second, the main model is trained via reinforcement learning to maximize scores from the reward model. Because the second stage optimizes against the reward model rather than humans directly, it scales without requiring continuous human input.
RLHF is how base models — which simply predict the next token — are transformed into helpful assistants that follow instructions. Most frontier models use RLHF or a close variant (such as constitutional AI, RLAIF, or DPO) as part of their post-training pipeline.
Knowledge cutoff
Because training is a one-time process, a model’s knowledge is frozen at the date its training data was collected. The model has no awareness of events, publications, or other changes in the world that occurred after that date. This is known as the knowledge cutoff. It is one of the core limitations of LLMs, and the primary motivation for augmenting LLMs with techniques such as retrieval-augmented generation (RAG) – see below.
Tokens
LLMs do not process text character-by-character or word-by-word. Instead, their base unit is a token.
A token is a chunk of text, typically a word, part of a word, or a punctuation character. Common words usually fit into a single token. Rarer or longer words may be split into multiple tokens. As a rough guide, one token is approximately four characters, or three-quarters of the average word in English.
Tokens are the unit of measurement for both input (the prompt) and output (the response). LLM pricing is typically quoted per million tokens (eg. $1.50/1M input, $3.25/1M output). A model’s context window (see below) and other capabilities are also measured in tokens.
Embeddings
Before a model can process tokens, the tokens must first be converted into a numerical form. An embedding is this numerical representation of a token. An embedding is a vector – a list of floating-point numbers – where each value captures some aspect of the token’s meaning.
Tokens with similar meanings produce vectors that are close together in this multi-dimensional space. This is how a model captures semantic and syntactic relationships between words.
Weights
Weights refer to a model’s learned parameters – the billions of numbers that encode everything the model learned during training. Weights determine how a model transforms inputs into output.
Small changes in weights, eg. through fine-tuning, can dramatically alter the behavior of a model.
The embedding lookup table is one part of a model’s weights.
An LLM with 100 billion parameters has roughly that many weights, stored either in memory or on disk.
Inference
Inference is what an LLM does to generate output. This process is entirely distinct from training, fine-tuning, and quantization, which are processes to create a model in the first place. Inference is how a model, once made, works.
The term comes from logic and statistics, where "inference" means drawing a conclusion from available information. In LLMs, inference is the process of calculating the next token based on past learning.
When you type a prompt into an LLM and get a response back, the entire process that happens behind-the-scenes – loading the model, tokenizing your input, generating each token of the output – is inference.
The model’s weights – the billions of numerical parameters that encode everything it learned during training – are fixed. Inference is the process of running data through those fixed weights to produce an output.
Compared to training, inference is relatively cheap. Depending on the model size, it can be done on a single workstation GPU. Available VRAM is the primary constraint on the size of the model you can run, and the performance – the throughput, measured in tokens per second – of that model.
Quantization
Quantization is a technique for reducing the numerical precision of a model’s weights, for example representing values with 4-bit integers instead of 16-bit or 32-bit floats. This reduces memory footprint and speeds up inference, at the cost of a small reduction in output quality.
Quantization is what makes it practical to run large models on consumer hardware with limited VRAM.
Distillation
Distillation (or knowledge distillation) is a technique for transferring the capabilities of a large, capable model into a smaller, cheaper one. The large model is the teacher and the smaller model is the student. Rather than training the student from scratch on raw data, it is trained to imitate the teacher — either by matching the teacher’s full output probability distribution (the "soft labels", which carry more information than a single correct answer alone), or simply by training on large volumes of synthetic data generated by the teacher.
The result is a model that is far smaller and faster to run, yet retains much of the teacher’s capability. This is how many light models are produced (see Light models, below): Google’s Gemini Flash models, for example, are distilled from the larger Gemini Pro models.
Distillation can be done in two broad settings:
-
White-box — the distiller has full access to the teacher’s internals, including its output probabilities. This gives the richest training signal, and is used by labs distilling their own models.
-
Black-box — the distiller has access only to the teacher’s text outputs, typically through an API. The student is trained on prompt–response pairs generated by the teacher.
Black-box distillation of a competitor’s proprietary model is contentious. It generally violates the terms of service of the model being copied, and has become a flashpoint in the industry. DeepSeek was widely suspected of distilling OpenAI’s models; the US has characterized the practice as industrial espionage when done by Chinese labs; and in April 2026, during the Musk v. Altman trial, Elon Musk acknowledged that xAI had distilled OpenAI’s models while training Grok. Once "a common practice in the AI world", distilling others' models is becoming less and less accepted.
Context window
The context window – or simply the "context" – is the maximum number of tokens an LLM can process in a single interaction. It is the combined total of the system prompt, conversation history, and the current input and output. Text outside the context window is not visible to the model.
Context window size is one of the most impactful characteristics of models on the user experience. A larger context window allows a model to consider more of a document, a longer conversation history, or a larger codebase at once.
Modern frontier models support context windows of 128K tokens or more. Some extend to 1M tokens and beyond.
KV cache
During inference, the attention mechanism requires each token to be compared against every other token in the context. Doing this from scratch for every new token generated would be enormously wasteful. The keys and values for all previously processed tokens would have to be recomputed identically at each step.
A key-value cache solves this by storing the computed key and value vectors for every token already processed, so they only need to be calculated once and can be reused as each subsequent token is generated. This trades memory for compute, and is what makes token-by-token generation practical at interactive speeds.
The cost is that KV caches grow linearly with context length and model depth. A large context window requires proportionally more VRAM to hold its KV cache. This is one reason why large context inference is memory-intensive even when the model itself fits comfortably in VRAM.
This is also the problem that vLLM’s PagedAttention addresses (see the section on inference engines, below). It manages KV cache memory more efficiently to support higher request concurrency.
Temperature and sampling
When generating a response, an LLM does not simply pick the most probable next token at every step. That would produce repetitive, deterministic output. Instead, the model samples from a probability distribution of possible next tokens.
Temperature is the parameter that controls how sharp or flat this distribution is. At low temperatures (close to 0), the model almost always picks the highest-probability token, producing focused and predictable output. At higher temperatures, lower-probability tokens become more likely to be chosen, producing more varied and creative – but also less reliable – output.
For factual or code-generation tasks, a low temperature is usually preferable. For creative writing, a higher temperature produces more interesting results.
Most LLMs expose temperature as a configurable parameter.
Hallucinations
LLMs generate text by predicting the most statistically plausible next token, not by reasoning from facts. As a result, they can produce outputs that are fluent and confident but factually incorrect – a phenomenon known as hallucination.
A model may fabricate citations, invent plausible-sounding names, or state incorrect information as though it were certain. These are all examples of hallucinations.
Hallucinations are an inherent limitation of how LLMs work, not a bug that can be fully fixed. The practical mitigations are:
-
Grounding model outputs in retrieved facts (RAG – see below).
-
Verifying outputs independently.
Evals
Evals (short for evaluations) are structured tests used to measure how well an LLM or LLM-powered application performs on specific tasks. They are the AI equivalent of unit tests and regression tests.
An eval runs the model through a predefined set of inputs and compares its responses against expected outputs or criteria — such as factual correctness, safety, helpfulness, tone, or task completion. This allows developers to quantify progress, catch regressions when switching models or changing prompts, and guide improvements over time.
Evals are particularly important in production applications, where the quality of model outputs directly affects user experience. They are often cited as the most critical, yet most overlooked, practice in building reliable AI products.
Teams typically write their own evals tailored to their use case, but open-source frameworks supply ready-made harnesses and shareable test suites. OpenAI Evals, for example, is a framework for evaluating LLMs paired with an open registry of reusable evals that anyone can contribute to. The public benchmarks described under Leaderboards and benchmarks (below) are, in effect, standardized evals run across many models.
Retrieval-augmented generation (RAG)
A core limitation of LLMs is that their knowledge is frozen at the point of training. By default, models cannot access new information or browse the web. In addition, LLMs are prone to hallucination — generating fluent, confident-sounding responses that are factually incorrect — particularly when asked about topics outside their training data or at the edges of their knowledge.
Retrieval-augmented generation (RAG) is a technique that addresses these issues. It combines an LLM with an external knowledge source, which is retrieved at inference time and loaded into the context window.
In a RAG system, a user’s query is used to search a knowledge base (a vector database, a search index, or live web results). The most relevant results are fetched and injected into the model’s context alongside the original query. The model then generates a response grounded in that retrieved content, rather than relying solely on what it learned during training.
RAG is widely used to build knowledge-aware applications – internal document Q&A systems, customer support bots, and AI search engines. Perplexity is a prominent consumer example. It uses RAG over live web search to answer questions with cited sources in real time. See AI assistants for other examples.
Frameworks such as LangChain and LlamaIndex are commonly used to build RAG pipelines, providing the retrieval, chunking, and orchestration plumbing so developers do not have to wire it together from scratch.
GPU compute
Both training and inference are computationally intensive and rely heavily on GPUs, which excel at the massively parallel matrix operations that underpin neural networks.
The dominant frameworks for programming GPUs are listed below. These are the software stacks that allow programs like llama.cpp or vLLM (see below) to access GPUs for general compute, rather than graphics workloads.
-
CUDA – NVIDIA’s proprietary parallel computing platform, dominant in AI/ML workloads due to its maturity and deep integration with frameworks like PyTorch and TensorFlow. Requires NVIDIA hardware.
-
ROCm – AMD’s open-source GPU computing platform, their equivalent of CUDA for AMD hardware. Growing PyTorch and TensorFlow support, but still lags behind CUDA in ecosystem maturity.
-
OpenCL – An open standard for parallel computing across heterogeneous hardware (CPUs, GPUs, FPGAs) from any vendor. More portable than CUDA but less widely adopted in the AI ecosystem.
Inference engines
Inference engines are the low-level runtimes that actually execute a model – loading weights into memory and running the computations that produce each token. Model managers like Ollama and LM Studio (see below) are built on top of them.
The two most widely used inference engines for open-weight models are:
-
llama.cpp – A C/C++ inference engine optimized for running quantized models on consumer hardware, including CPU-only machines, Apple Silicon, and devices with limited VRAM. Uses the GGUF model format. It is the engine that powers Ollama and LM Studio. Prioritizes broad hardware compatibility over maximum throughput. Llama.cpp can be made to use multiple GPUs, but it does not support tensor parallelism in the same way that vLLM does — model layers are assigned to GPUs rather than operations being distributed across them, which limits throughput gains from multi-GPU setups.
-
vLLM – A Python-based inference engine designed for high-throughput via one or more GPUs. The key differentiating feature is that vLLM supports multi-GPU tensor parallelism, which means (in plain English) workloads are more evenly distributed across multiple GPUs. This efficiency shows up in better throughput metrics. vLLM is really intended for optimizing multi-GPU workloads that serve multiple users. Its key innovation is PagedAttention, which manages the KV cache more efficiently, enabling higher request concurrency. This is a good choice when serving many users concurrently, rather than running models locally for a single user – though you can use it for that purpose too.
Choosing LLMs
In choosing an LLM, you need to appreciate the tradeoffs to be made between capability, size, speed, and cost. The bigger the AI model, the more capable it is. But the trade-off is that the big, advanced "frontier" models are also slow and costly to run.
At the other end of the spectrum are small models that are optimized for speed rather than capability. Some of these small models may also specialize in particular domains, rather than being capable of lots of different tasks.
Somewhere in the middle are the general-purpose models that find a good balance between these factors, and can do most tasks well.
Leaderboards and benchmarks
Several independent sites aggregate model rankings, pricing, speed, and availability – useful for comparing options before committing to a model:
-
Artificial Analysis – Cross-provider leaderboards spanning intelligence, output speed (tokens/sec), latency, context window, and price. The source for many of the model stats on this page.
-
OpenRouter rankings – Rankings derived from real usage across the OpenRouter platform (a single API to many providers) – a useful signal of what developers actually deploy.
-
LMArena – Crowd-sourced, blind, head-to-head human voting (formerly LMSYS Chatbot Arena), reported as Elo ratings.
-
Vellum LLM Leaderboard – Side-by-side comparison of benchmark scores, context windows, speed, latency, and per-token pricing, with a separate open-weight leaderboard.
-
llm-stats.com – Aggregated benchmark scores, pricing, and context-window comparisons.
-
Epoch AI – Research-grade tracking of frontier benchmark results over time.
-
Open LLM Leaderboard (Hugging Face) – Standardized benchmarks for open-weight models.
Specific benchmark suites measure particular capabilities:
-
SWE-bench – Tasks models with resolving real GitHub issues against real repositories. The human-validated SWE-bench Verified subset is the de-facto standard for coding ability and is widely cited (including in the model tables on this page). Maintained by Princeton and Stanford.
-
Terminal-Bench – Benchmarks agentic front-ends on real terminal tasks, backed by various models.
-
Aider Polyglot – Code-editing benchmark over 225 Exercism exercises across six programming languages.
-
LiveBench – Contamination-resistant benchmark whose questions are refreshed monthly, covering maths, coding, reasoning, language, instruction-following, and data analysis.
The model tables on this page also cite academic benchmarks, including GPQA Diamond (graduate-level science reasoning), AIME (competition mathematics), MMLU / MMMU (broad multi-subject knowledge), ARC-AGI-2 (abstract reasoning), Humanity’s Last Exam (HLE) (expert-level frontier questions), and GDPval (economically-valuable knowledge work).
Matching tasks to models
The single most useful question when choosing a model is: what kind of task is this? Each task type stresses a different capability — language quality, step-by-step reasoning, code quality, factual grounding, raw speed — and that dominant capability, not the model’s headline benchmark score, is what should drive the choice. The aim is to match the model to the work, rather than defaulting to the biggest model for everything (wasteful) or the cheapest (unreliable).
| Task type | What matters most | Recommended model | Thinking level |
|---|---|---|---|
Prose & language — proofreading, copy-editing, summarizing, translation, tone |
Language quality, instruction-following |
Strong frontier or mid-tier chat model (Claude Sonnet/Opus, GPT-class) |
Low / none |
Complex reasoning — maths, logic, planning, multi-step analysis |
Step-by-step reasoning |
Reasoning model with extended thinking (OpenAI o3, DeepSeek-R1, Claude/Gemini thinking modes) |
High |
Coding |
Code quality, large context window |
Strong coding model (Claude Sonnet/Opus; Qwen2.5-Coder locally) |
Scale to task complexity |
Factual lookup, research, current events |
Grounding in retrieved facts |
RAG- or search-augmented model (Perplexity/Sonar) |
Low–medium |
High-volume, simple, latency-sensitive — classification, extraction, routing |
Speed and cost |
Light/flash model (Claude Haiku, Gemini Flash, GPT mini) |
None |
Multimodal — image, audio, or video as input or output |
Modality support |
Multimodal frontier model (Gemini 3 Pro, GPT-4o) |
Task-dependent |
Semantic search, RAG retrieval |
Quality of the vector representation |
Embedding model, plus a reranker for a second pass |
N/A |
Privacy-sensitive or offline |
Data locality |
Open-weight model run locally, sized to your hardware |
Task-dependent |
Two principles underpin the table:
-
Match the model to the dominant capability the task demands. A task that is easy along one axis may be hard along another — proofreading is linguistically demanding but reasoning-light, whereas a maths word problem is the reverse.
-
Thinking level is a separate dial from model tier. Extended thinking (see Reasoning models, below) helps on problems with a verifiable chain of logic — maths, code, planning — but adds latency and cost for little benefit on language tasks, and can even push a model to over-edit. Turn it down for prose, up for reasoning.
Capability types
Beyond size and cost, models differ significantly in what they are designed to do. The main capability types are:
Base models are the raw, pre-trained models before any instruction tuning or RLHF is applied. They predict the next token without any conversational or helpfulness alignment, and are rarely used directly. They serve as the starting point for fine-tuning. Examples: Llama 3 (base), Mistral 7B (base).
Chat models are instruction-tuned to follow natural-language instructions and hold multi-turn conversations. These are the general-purpose workhorses — the models behind ChatGPT, Claude, and Gemini. Almost all frontier and mid-tier models fall into this category. Examples: GPT-4o, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B Instruct.
Flash models: These are optimized for producing a response in a fraction of the time it takes other models. They often rely on shortcuts like caching or pre-computing certain results to speed up generation.
Code completion models are optimized for programming tasks and typically deployed as IDE autocomplete backends rather than chat interfaces. A distinctive capability is fill-in-the-middle (FIM): generating code that fits between an existing prefix and suffix, which is how IDE autocomplete works. Examples: Codestral, Code Llama, Qwen2.5-Coder.
Reasoning models spend extra compute working through a problem step-by-step — producing an internal chain of thought — before arriving at a final answer. They trade higher latency and token cost for substantially better performance on complex mathematical, logical, and multi-step problems. See the Reasoning models section below. Examples: OpenAI o3, DeepSeek-R1, Claude with extended thinking.
Embedding models convert text into dense numerical vectors that represent the semantic meaning of the input. Unlike generative models, they produce a fixed-length vector rather than text. These vectors are used to power semantic search, RAG pipelines, clustering, and classification. Examples: OpenAI text-embedding-3, Cohere Embed, Nomic Embed, BGE.
Reranker models are used as a second pass in RAG pipelines. After an initial retrieval step fetches candidate documents, a reranker scores each candidate’s relevance to the query more precisely, so the most useful results are surfaced before being passed to the generative model. Examples: Cohere Rerank, BGE Reranker, cross-encoder models.
Multimodal models accept inputs — or produce outputs — beyond text, such as images, audio, and video. Most modern frontier models are multimodal to some degree. Examples: GPT-4o (image input), Gemini 3 Pro (image, video, and audio input and output), Claude Opus 4.7 (image input).
Frontier models
A frontier model is a model – or, more commonly, a family of models – that represents the current state of the art.
Frontier models are typically the largest models, trained on the most data, with the best performance across a broad range of tasks. They are developed by well-resourced AI labs and accessed via paid APIs rather than downloaded and run locally.
Frontier models are usually proprietary – which means their weights are not publicly released. The companies that develop them invest heavily in safety evaluation and alignment – the process of steering model behavior to be helpful, honest, and to avoid harmful outputs.
Modern frontier models are increasingly multimodal – capable of processing not just text but also images, audio, and video as inputs, and in some cases generating them as outputs.
Examples of frontier model families include:
Other notable proprietary providers include Cohere, whose Command family targets enterprise use cases with strong retrieval-augmented generation support (Cohere also produces the widely-used Embed and Rerank models – see Capability types, below). The leading Qwen (Alibaba), DeepSeek, Kimi (Moonshot AI), and GLM (Z.ai) families also reach near-frontier performance while releasing open weights – covered under Open-weight models, below.
As of March 2026, OpenAI GPT 5.2 is the flagship model in OpenAI’s GPT lineup. It’s a very well-rounded model – it can do a bit of everything (multimodality) and is good at chaining together multiple actions. (The following stats are sourced from Artificial Analysis.)
| Category | Details |
|---|---|
Architecture |
Advanced transformer variant (proprietary) |
Model size |
~150B-300B parameters (estimate, exact size not publicly disclosed) |
Price |
$1.75/1M input tokens • $14/1M output tokens • $0.175/1M cached input |
Speed |
~150-300 TPS (tokens per second) |
Context window |
256K tokens |
Benchmarks |
GDPval (knowledge work) 70.9% • SWE-Bench Verified (real-world coding) 80% • GPQA Diamond (graduate-level science reasoning) 92.4% • ARC-AGI-2 (abstract reasoning) 52.9% |
Multimodality |
Text • Code • Image |
For Anthropic, as of May 2026 its Claude Opus 4.7 is its flagship model. It’s another all-rounder, but specializes in text and code generation, rather than aiming for multimodality. Unlike GPT 5.2, it cannot do image generation directly. It is also currently the most expensive and slowest of the frontier models, but it is highly regarded for its coding abilities.
| Category | Details |
|---|---|
Architecture |
Advanced transformer variant (proprietary) |
Model size |
Not publicly disclosed |
Price |
$5/1M input tokens • $25/1M output tokens • $0.5/1M cached input |
Speed |
~72 TPS |
Context window |
200K tokens (standard) • 1M tokens (beta) |
Benchmarks |
GDPval-AA (knowledge work) Elo 1606 • SWE-Bench Verified (real-world coding) 80.8% • GPQA Diamond (graduate-level science reasoning) 91.3% • ARC-AGI-2 (abstract reasoning) 68.8% |
Multimodality |
Text • Code |
But, as of May 2026, it is Grok 4.1 that is widely considered to be the most capable of the frontier models. It is fast, has a very large context window, and is cheap relative to other frontier models. It is noted for strong reasoning across diverse tasks and a highly permissive output style.
| Category | Details |
|---|---|
Architecture |
Advanced transformer variant (proprietary) |
Model size |
~1.5T-2T (estimate, exact size not publicly disclosed) |
Price |
$0.20/1M input tokens • $0.50/1M output tokens |
Speed |
~170 TPS |
Context window |
2M tokens |
Benchmarks |
LMArena Text Arena 1843 Elo (#1) • GPQA Diamond (graduate-level science reasoning) 85.7-89.4% • AIME 2025 (math) 92-94% • ARC-AGI-2 (abstract reasoning) 52.9% in thinking mode |
Multimodality |
Text • Code • Image |
As of May 2026, Gemini 3 Pro is the current flagship in Google’s Gemini series. Its performance is pretty on par with the rival flagship models, but it has two things going for it: a large context window, and (what really makes it stand out) multimodality capabilities. It can analyze and generate all kinds of media, not only text.
| Category | Details |
|---|---|
Architecture |
Advanced transformer variant (proprietary) |
Model size |
~2T (estimate, exact size not publicly disclosed) |
Price |
$2/1M input tokens • $12/1M output tokens (standard) |
Speed |
~90-120 TPS |
Context window |
1M-2M tokens |
Benchmarks |
LMArena Text Arena 1501 Elo • GPQA Diamond (graduate-level science reasoning) 91.9% • Humanity’s Last Exam (HLE) 37.5% • ARC-AGI-2 (abstract reasoning) 84.6% in deep thinking mode |
Multimodality |
Text • Code • Image • Video • Audio |
Mid-tier models
The mid-tier models are the workhorses that you will probably use 80% of the time for general tasks. They offer a good balance between capability, size, speed, and cost. These are the defaults for a lot of agentic workflows.
Examples include:
| Model | Developer | Cost | Comments |
|---|---|---|---|
Claude Sonnet 4.6 |
Anthropic |
$3 input / $15 output per 1M tokens |
Best-in-class coding performance (77.2% SWE-bench) • 1M context window • Extended thinking capabilities |
GPT-4.1 |
OpenAI |
$2 input / $8 output per 1M tokens |
Reliable generalist with strong instruction following • 1M context window • 103 TPS throughput |
Llama 3.3 70B |
Meta |
$0.10 input / $0.32 output per 1M tokens |
Open-weight workhorse approaching 405B performance for text tasks • 131K context • Wide provider availability |
Qwen3 8B |
Alibaba |
$0.05 input / $0.40 output per 1M tokens |
Compact, high-performance model • Excellent price-to-capability ratio • 32K context window |
Claude Sonnet 4.6 is a mid-tier model from Anthropic that is particularly strong at coding tasks. It’s cheaper than Opus 4.7, the flagship model, but still perfectly capable at advanced programming tasks, including building greenfield projects from scratch.
Light models
Light models are faster and cheaper, but less capable than the frontier/flagship models. Yet some of these are still incredibly capable. Gemini 3 Flash, for example, maintains most of the capabilities of the flagship Gemini 3 Pro model. This is achieved by knowledge distillation, whereby a smaller student model is trained to replicate the behaviour of a larger teacher model, producing a model that is smaller in size but retains much of the larger model’s capability.
| Model | Developer | Speed | Comments |
|---|---|---|---|
Gemini 3 Flash |
180-250 TPS |
Pro-grade reasoning at Flash latency • 3x faster than Gemini 2.5 Pro with 30% better token efficiency |
|
GPT-5 mini |
OpenAI |
80-130 TPS |
Latest mini model • Optimized for high-throughput tasks and API chaining |
Claude Haiku 4.5 |
Anthropic |
86-150 TPS |
Matches Sonnet 4 coding performance at 1/3 cost and 2x speed • 200K context window |
Gemini 2.5 Flash-Lite |
350-400 TPS |
Fastest proprietary model • 50% fewer output tokens for extreme cost efficiency |
|
Ministral 8B |
Mistral AI |
250-300 TPS |
Edge-optimized with 128K context • Outperforms Llama 3.1 8B at $0.10/1M tokens |
Liquid LFM 2.5 |
Liquid AI |
300-370 TPS |
Non-transformer architecture • Constant speed regardless of context length |
Reasoning models
Reasoning models are a category of LLM that spend additional compute "thinking" through a problem before producing a final response. Rather than generating an answer immediately, they produce an internal chain-of-thought – working through steps, checking logic, and reconsidering – before arriving at a conclusion. This makes them substantially better at complex mathematical, logical, and multi-step problems, at the cost of higher latency and token usage.
Notable reasoning models include OpenAI’s o1 and o3 series and the open-weight DeepSeek-R1.
Effort levels
Where reasoning models can think before answering, an effort level (sometimes "reasoning effort" or "thinking level") is the dial that controls how much they do so. It drives adaptive reasoning: rather than a fixed amount of thinking, the model decides whether and how much to reason on each step, based on the complexity of the task at hand. Lower effort is faster and cheaper for straightforward work; higher effort buys deeper reasoning on complex problems.
An effort level is a behavioral signal, not a strict token cap. It nudges the model toward more or less deliberation — trading token spend against capability — rather than hard-limiting the number of tokens it may use.
The dial is exposed differently by different tools — for example as a reasoning_effort parameter on some APIs, or as a configurable effort level in coding agents such as Claude Code. As a rough guide for picking a level:
-
Low — mechanical work: renames, greps, formatting fixes, simple lookups.
-
Medium / high — most everyday coding and analysis.
-
Top levels — reserve for genuinely hard problems where deep reasoning pays off, since they burn considerably more tokens.
This is the same idea as the thinking level column in the task-matching table above: effort is a separate dial from model tier. Turning it up helps on problems with a verifiable chain of logic (maths, code, planning), while adding latency and cost for little benefit on language tasks.
Specialist models
Specialist models are fine-tuned for a narrow domain or task, rather than general-purpose use. Perplexity’s Sonar model is an example.
| Category | Details |
|---|---|
Architecture |
Built on Llama 3.3 70B • Search-optimized with web retrieval and citation generation |
Model size |
70 billion parameters |
Price |
Sonar: ~$1/1M input tokens, ~$1/1M output tokens • Sonar Pro: ~$3/1M input tokens, ~$15/1M output tokens • Additional request fees based on search context size |
Speed |
Ultra-high throughput: 1,200 TPS on Cerebras infrastructure • ~10x faster than comparable models like Gemini 2.0 Flash |
Context window |
Sonar: 128K tokens • Sonar Pro: 200K tokens • Sonar Pro Reasoning: 128K tokens |
Benchmarks |
Simple QA: Sonar 77.3% Sonar Pro 85.8% • Outperforms Llama 3.3 70B Instruct and GPT-4o in answer factuality and readability |
Multimodality |
Text only • Specialized for web search with real-time information retrieval, grounded citations, and source customization |
Open-weight models
Open-weight models have publicly released weights, meaning anyone can download and run them. This is distinct from open-source, which would also require the training data and code to be public – most open-weight models do not meet that bar.
Licensing terms vary. Some permit commercial use, others restrict it.
The main consideration when choosing between open-weight models is the model size. Model size is commonly expressed as the number of trainable parameters, in billions – for example, a 7B model has 7 billion parameters. More parameters generally means greater capability but higher memory and compute requirements. A 7B model can run comfortably on a consumer GPU; a 70B model requires a high-end workstation or server; while a 405B model requires multiple high-end GPUs.
Open-weight models, which can be downloaded and run on your own computer or server, are a good choice for privacy-sensitive use cases, offline environments, or experimenting with open models without incurring API costs.
Open-weight models can also be customized through further fine-tuning and through RAG. Perplexity’s Sonar model is an example of an open-weight model that has been fine-tuned for specific use cases.
Hugging Face is the primary hub for discovering and downloading open-weight models, hosting over two million models contributed by individuals, research labs, and major organizations. The other major resource in this space is Ollama’s model database, which is a more curated library of models that are proven to be stable with the Ollama model manager.
Hardware requirements
It is useful to think of open-weight models as falling into a spectrum of size tiers, with each tier having different minimal hardware requirements:
-
Ultra-light (1–3B) — Run on almost any machine with 8 GB+ RAM. Capable of basic chat and simple tasks. Examples: Gemma 3 1B, Llama 3.2 1B/3B, Qwen3 1.7B, DeepSeek R1 1.5B.
-
Small (3–8B) — A good balance of quality and speed. Need ~8–16 GB RAM. The recommended starting point for most local setups. Examples: Phi-4 Mini (3.8B), Gemma 3 4B, SmallThinker 3B, Mistral 7B, Qwen3 8B, Llama 3.1 8B.
-
Medium (8–13B) — Well-rounded general-purpose models. Run well on 16–24 GB RAM with a mid-range GPU. Examples: Gemma 3 12B.
-
Large (30–70B) — High capability. Require at least 32 GB RAM and a dedicated GPU with 24+ GB VRAM. Examples: GPT-OSS 20B, Qwen 32B.
-
Very large (70B+) — Near-frontier capability. Require high-end workstations with at least 64 GB RAM and multi-GPU setups. Require at least 64 GB each of RAM and VRAM. Examples: Mistral Small 4, Llama 70B.
|
To estimate whether your own hardware can run a given model — accounting for quantization level (Q4–F16), context length, and expected token throughput (TPS) — try the Local AI VRAM Calculator & GPU Planner. It maps a hardware setup and use case to model recommendations, with the caveat that the figures are planning estimates, not performance guarantees. |
Custom hardware like the Nvidia DGX Spark give you the ability to run and train large language models on your own hardware.
Open-weight model families
As of 2026, some notable open-weight model families include:
| Several of these models use a mixture-of-experts (MoE) architecture. Rather than activating the entire network for every token, an MoE model routes each token through only a small subset of specialized "expert" subnetworks. This lets a model carry a very large total parameter count while activating only a fraction of it per token — high capacity at a lower inference cost. |
-
Llama models from Meta. The Llama family did much to kick-start the open-weight era, and its widespread adoption has made the underlying architecture a de facto reference point that many other open models build on. The Scout variant is known for its massive 10-million token context window. Released under Meta’s community license, which permits commercial use but imposes some restrictions (including for very large deployments).
-
Gemma models from Google. These are Google’s open-weight counterparts to its proprietary, frontier Gemini models. Gemma models are optimized to run on consumer hardware.
-
Qwen models from Alibaba Cloud are available in a wide range of sizes, so there’ll always be one variant that you can run on your hardware.
-
Mistral models from Mistral AI, a French company that produces both open-weight models and proprietary models accessible via web service APIs. Their Mixtral models helped popularize the mixture-of-experts (MoE) architecture in open models.
-
Phi models from Microsoft. A family of small language models (SLMs) that punch well above their size on reasoning and maths, achieved largely by training on high-quality, "textbook-quality" filtered and synthetic data rather than sheer scale. Optimized for on-device and edge use, and released under the permissive MIT license. Recent releases include Phi-4 (14B) and the compact Phi-4-mini, with multimodal variants.
-
gpt-oss models from OpenAI.
gpt-oss-120bandgpt-oss-20b, released under the permissive Apache 2.0 license in August 2025, were OpenAI’s first open-weight models since GPT-2. -
Kimi models from Moonshot AI. Kimi K2.5 (released January 2026) is an open-weight mixture-of-experts model with 1T total parameters (32B activated per token) and a 256K-token context window. It matches frontier models on several benchmarks and is multimodal across text, code, image, and video (understanding only, no generation). Kimi 2.5 leads in mathematical reasoning and is optimized for "agent swarming".
-
GLM models from Z.ai (formerly Zhipu AI). The GLM-4.5 and GLM-4.6 releases are strong open-weight MoE models, well-regarded for their "interleaved thinking" and demonstrated capabilities in both agentic and vibe coding workflows.
-
DeepSeek models from DeepSeek AI. This AI lab made waves in early 2025 when DeepSeek-R1 demonstrated near-frontier reasoning performance at a fraction of the training compute cost that had previously been assumed was necessary to develop frontier models. Its "engram" memory system has been proven to maintain high coding performance across 1M+ token contexts.
-
MiniMax models from MiniMax AI, which also develops closed-weight models. MiniMax M2.5 is well-regarded, posting near-frontier agentic and coding scores (~80% on SWE-Bench Verified) at an exceptionally low cost.
Nvidia continues to expand its open-weight offerings, most recently with the Nemotron 3 family (Nano, Super, and Ultra variants).
| "Open-weight" is not the same as "open-source". Most of these families release only their trained weights, not the training data or full reproduction recipe, and some ship under restrictive custom licenses. Llama and Gemma, for example, carry usage restrictions, whereas Qwen, DeepSeek, and Mistral’s open releases mostly use permissive Apache 2.0 or MIT terms. |
Open-weight coding models
The following open-weight models are highly regarded for programming tasks. Small models (≤ 8B) run comfortably on a laptop or CPU-only setup; larger models require a GPU with ≥ 16–24 GB VRAM.
-
Qwen2.5-Coder 32B Instruct — Alibaba’s dedicated coding model that consistently tops open-weight coding benchmarks. Strong across Python, JavaScript, Go, and SQL. Supports a 32K context window. Apache 2.0 license.
-
DeepSeek-Coder-V2 Instruct — A Mixture-of-Experts model (236B total, 21B active) with very strong multi-language coding performance and a 128K context window. One of the most capable open-weight options for complex, multi-file tasks. MIT license.
-
Codestral 22B — Mistral AI’s dedicated code model, optimized for low-latency code completion and fill-in-the-middle (FIM) tasks. Well suited to IDE integration. Supports 32K context. Mistral AI Non-Production License (commercial use requires a separate agreement).
-
StarCoder2 15B — The successor to StarCoder, trained on The Stack v2. Strong on Python, JavaScript, and C/C++, and well suited to code search and autocompletion tasks. Apache 2.0 license.
-
Code Llama 70B Instruct — Meta’s instruction-tuned coding variant of Llama 2, with a 16K context window and strong performance on multi-language tasks and large-scale refactoring. Meta’s non-commercial license restricts use to research.
Model access layers
Although it is possible to interact with models directly, for most use cases you will interact with models via an access layer. The access layer is responsible for handling authentication, response formatting, and other details that are specific to a particular model provider or use case.
There are two useful lenses on your options. The first is where the model runs, and who operates the infrastructure — the deployment patterns described next. The second is the tooling you use to reach the model: broadly, model gateways (for hosted models) and model managers (for running open-weight models locally), each covered in its own subsection below.
Deployment patterns
There are five broad patterns for running and accessing LLMs, spanning a spectrum from full local control to fully managed services:
-
Local — The model runs on your own machine, with direct access to your data. Model managers such as Ollama LM Studio, and Open Web UI (see Model managers, below) handle downloading, hardware management, and serving a local API. Best for privacy, offline use, and experimentation, with no per-token cost — but you are limited by your own hardware.
-
VPS + rented GPU — You run your own models on infrastructure you rent, then expose them over an API (for example, Ollama behind an Nginx reverse proxy). Hosting providers include DigitalOcean and RunPod. You keep full control over the model and the data path while offloading the hardware, but you own the operations — scaling, uptime, and security. Higher up-front costs, but cheaper in the long-term than managed cloud…
-
Managed cloud — Your model runs on a cloud provider’s managed ML platform, which handles provisioning, scaling, and serving. Examples include Amazon SageMaker, AWS Bedrock, Google Vertex AI, Azure ML, Modal, and Beam. Well suited to organizations already invested in a particular cloud, and to enterprise governance and data-residency requirements. Also good for unpredictable workloads, and for deploying fine-tune custom models with autoscaling. Trade-offs: complex setup, steep learning curve, and high costs and low scale.
-
Managed inference API — You call a provider’s hosted endpoint with just an API key, choosing from their model library and paying per use or by subscription. Examples include Groq, Together AI, Fireworks AI, Replicate, OpenRouter, and the Hugging Face Inference API. There is no infrastructure to manage, but you have the least control over where and how the model runs. Great for indie hackers and shipping fast. Free tiers are available, and you can experiment with lots of model variants. (Model gateways, below, are the tooling for this pattern.)
The first-party APIs of the frontier labs — Anthropic, OpenAI, and Google’s Gemini API — are a variant of this pattern. The difference is that you are calling the lab’s own proprietary model directly, rather than choosing an (often open-weight) model from a third party’s library. This is typically the only way to access the very latest frontier models, but it ties you to a single vendor.
-
Edge / on-device — The model is embedded directly into your application and ships to your users' devices, running entirely client-side with no server. Tooling includes WebLLM, Apple MLX, MLC LLM, llama.cpp, ONNX Runtime, and Transformers.js. Best for privacy, low latency, and offline support, but constrained by the capabilities of the user’s device – typically, you’e looking at ultra-light models, 1-3B parameters maximum.
Model gateways
Model gateways provide a single, unified API endpoint, and/or a GUI or other user interface, to access models from multiple hosted LLM providers. Rather than integrating directly with each provider’s SDK and managing separate authentication, response formats, and failover logic, developers integrate once with the gateway, which handles routing, fallback, cost optimization, and observability across providers.
-
OpenRouter – Unified API for 300+ models across multiple providers, with per-request routing, cost tracking, and usage rankings.
-
Anannas – OpenAI-compatible API gateway with intelligent routing, automatic failover, and real-time cost dashboards.
-
Hugging Face Inference API – Managed inference endpoints for models hosted on Hugging Face, with serverless and dedicated deployment options.
-
Nvidia’s Build site includes free endpoints to access open-weight models.
-
Perplexity – AI-native search engine and general-purpose AI assistant. Provides access to a large number of models, and you can even experiment with multiple models simultaneously, feeding them the same inputs.
-
LiteLLM – Open-source gateway that exposes 100+ providers (OpenAI, Anthropic, Gemini, Bedrock, Azure, and more) through a single OpenAI-format API. Available as a Python SDK for direct integration, or as a self-hosted proxy server (its "AI Gateway") for organization-wide access, with virtual API keys, per-project cost tracking, automatic fallbacks, and load balancing. Unlike the hosted services above, you run it yourself.
-
OpenCode Zen – Curated gateway from the OpenCode team offering a vetted set of coding-benchmarked models (OpenAI, Anthropic, Google, Qwen, DeepSeek, and others) with pay-as-you-go per-token pricing and bring-your-own-key support. Integrates as an optional provider inside the OpenCode coding agent.
|
A single subscription to a gateway service like Perplexity Pro or OpenRouter can be more economical than maintaining separate subscriptions with multiple AI labs. These services give you access to models from OpenAI, Anthropic, Google, and others through one account, letting you switch between them based on the task at hand. |
Model managers
Model managers are primarily designed for running open-weight models locally. They handle downloading of models, managing hardware resources, and serving a local inference endpoint – typically an OpenAI-compatible API on localhost. Essentially, model managers wrap an inference engine with all the additional tools needed to managing and running large-language models.
By contrast, model API gateways (above) are middleware for routing requests to remote, hosted models across multiple cloud providers.
The two most popular model managers are:
Ollama is designed to run LLMs as a background service (daemon) on your machine. It provides a clean CLI for managing models and a local API (compatible with OpenAI’s API) that allows other applications — like IDE plugins, web UIs, or custom scripts — to communicate with the models you’ve downloaded. Ollama is also a model hosting service. It maintains a curated library of models, many of which have been optimized using quantization, allowing large models to run on consumer hardware.
LM Studio is a desktop GUI application that provides an "all-in-one" visual interface for discovering, downloading, and chatting. While it also provides a local server for API access, like Ollama, its primary value is its user-friendly dashboard that allows you to tweak hardware settings (like GPU offloading), monitor system resources in real-time, and chat with models without needing a terminal or other front-end integration. LM Studio integrates with the Hugging Face model library, allowing easy access with every available open-weight model.
Both Ollama and LM Studio use llama.cpp as their underlying inference engine.
A related but distinct tool is Open WebUI — a self-hosted, extensible web front-end for LLMs rather than a model manager in its own right. It provides a polished, ChatGPT-like browser interface and connects to a backend (most commonly Ollama, or any OpenAI-compatible API) to serve the actual models. It is often run alongside Ollama to give a local model stack a clean, multi-user web UI.
Fine-tuning tools and services
A number of free and commercial tools and services are available for fine-tuning open-weight models. These tools roughly split into two categories: DIY toolkits you run yourself (free, but you bring the GPU), and managed services (you pay, they handle infra). The choice depends on whether you have the compute, how much you want to learn, and whether your data can leave your network.
Free and open-source toolkits
-
Unsloth Studio is the standout for getting started on consumer hardware. It rewrites the attention kernels and gradient check-pointing to roughly halve VRAM use and double training speed versus stock Hugging Face models. The end result is that a 7B model fits comfortably in 12 GB; a 70B with QLoRA fits in 48 GB. Free for personal use. Their commercial offering adds multi-GPU support.
-
Axolotl is the workhorse for serious workflows. Configuration-driven (YAML), supports virtually every architecture and training method (full FT, LoRA, QLoRA, DPO, ORPO, multi-GPU via DeepSpeed/FSDP). Steeper learning curve than Unsloth but far more flexible.
-
LLaMA-Factory sits between the two. It is UI-driven (Gradio web interface) so you can fine-tune without writing config files, while still supporting most popular architectures and methods. Good middle path for teams new to fine-tuning.
-
Hugging Face TRL + PEFT are the underlying libraries everything else builds on. PEFT implements LoRA/QLoRA/prefix tuning; TRL adds RLHF, DPO, SFT trainers. Use these directly when you want full programmatic control.
-
nanochat and Autoresearch are Python "experiment harnesses" for training LLMs. The basic idea is that you give an AI agent a small training setup and let it experiment autonomously overnight. In short cycles, it trains, modifies the code, checks if the result improved, keeps or disregards, and repeats. You wake up to a log of experiments and, hopefully, a better model. In effect, these projects replace human researchers with an AI agent to do the fine-tuning. Autoresearch is a simplified version of Nanochat designed to run on a single consumer Nvidia GPU. Both projects are maintained by Andrej Karpathy, former AI director at Tesla and a founding member of OpenAI.
-
Torchtune (deprecated) was PyTorch’s official, opinionated fine-tuning library.
For training data, Argilla (now part of Hugging Face) is an open-source data curation and labelling platform for AI engineers and domain experts to maintain high-quality data sets, while distilabel is for synthetic data generation.
Other supporting tools include Weights & Biases, a software-as-a-service for tracking experiments and monitoring model checkpoints during the iterative development of models. MLFlow is similar but is fully self-hostable.
Commercial toolkits and managed services
-
Together AI is probably the most developer-friendly. It has a clean API, 200+ open models, transparent per-token billing (training tokens × epochs). Good for production workloads where you don’t want to babysit GPUs.
-
Fireworks AI — similar positioning to Together, strong on inference performance after training. Per-token training pricing.
-
Hugging Face AutoTrain — codeless web UI. Easiest entry point if you’ve never fine-tuned anything, but less flexible than the other options.
-
Modal, Runpod, and Northflank are GPU-rental platforms (aka, serverless GPU clouds) rather than fine-tuning products specifically. You bring your own Axolotl/Unsloth scripts and rent A100s/H100s by the second. Cheapest option if you can write code.
-
Amazon SageMaker / Gemini Enterprise Agent Platform (formerly Google Vertex AI) / Azure ML — enterprise managed options, which can be good options if you’re already invested in those particular public cloud services.
-
Predibase and OpenPipe are workflow-focused commercial offerings that handle data prep, training, and serving as a single pipeline. Predibase originated from Ludwig (open source) and layers a commercial product on top.
-
Prem Studio keeps training data, weights, and inference on infrastructure you control — useful for regulated industries where managed cloud platforms that route data through shared infrastructure (eg. Together standard tier, HF AutoTrain, OpenPipe, Fireworks) aren’t suitable.
See also
-
Machine learning – The broader field of which LLMs are a part.
-
Transformer architecture – The neural network architecture that underpins all modern LLMs.
-
Generative AI – Covers broader AI product categories including diffusion models for image generation.
-
AI assistant – General-purpose conversational interfaces to LLMs, such as ChatGPT, Claude, and Perplexity.
-
AI agent orchestration – Frameworks and platforms for building and coordinating agents (LangChain, CrewAI, and others).
References
-
OpenAI Cookbook, OpenAI – Example code, recipes, and best practices for working with the OpenAI APIs.
-
LLM Engineer Handbook – Curated collection of links and resources for AI engineers.
Books
-
Build a Large Language Model (from Scratch), Sebastian Raschka – Builds a transformer in raw PyTorch, layer by layer.
-
Hands-On Large Language Models, Jay Alammar & Maarten Grootendorst – Visual, practical guide to LLM applications.
-
LLM Engineer’s Handbook, Paul Iusztin & Maxime Labonne – Production LLMOps: fine-tuning, quantization, serving.
-
The Hundred-Page Language Models Book, Andriy Burkov – Concise, math-grounded path from n-grams to transformers.