English Reading

Ahmad on X: "LLMs 101: A Practical Guide (2026 Edition)"

🗓 2026年5月22日· 📚 精选词库 · 👀 1

LLMs 101: A Practical Guide (2026 Edition)

Start with the loop. Text becomes tokens. Tokens move through a Transformer. Attention decides which earlier tokens matter. The runtime keeps a KV cache so the model does not recompute the whole conversation every time. Then the model picks the next token and does it again.

> A practical guide to how LLMs work, how models think 1 token at a time, and how to run them locally.

Once that loop clicks, the hardware and software choices become easier to reason about. VRAM, quantization, context length, chat templates, decoding, RAG, serving engines, and model selection all fall out of the same mechanics.

> Start with the loop: Tokens in, probabilities out, one next token at a time.Weights tell the model what patterns it learned. Context tells it what it is looking at now. The KV cache is the working memory that keeps the loop usable.Hardware, runtimes, and model selection only make sense after you understand the memory, context, and formatting rules the model is obeying.

The goal is to make local LLM mechanics intuitive first, then give you a practical path into hardware, runtimes, serving, and current LLM research as of May 21, 2026.

This is a model-first guide. It starts with the mechanics: inference, tokens, Transformers, attention, KV cache, prefill, decode, decoding controls, model packages, chat templates, model types, long context, RAG, agents, fine-tuning, and multimodal models.

After that, it moves into the local deployment layer: what local really means, quantization, VRAM math, hardware tiers, runtime choices, serving modes, licenses, model selection, privacy, troubleshooting, benchmarks, setup paths, and practical use cases.

That order matters. You should understand why a long prompt costs memory before choosing a GPU. You should understand why chat templates matter before judging a model. You should understand whydecode is sequential before caring about tokens per second.

For the deeper hardware and software path, I have a three-part series teaching self-hosted LLMs / local AI:

- Part 1: . - Part 2: . - Part 3: .

The first two pieces explain the hardware capacity and bandwidth math. The third explains the software layer that turns that hardware into usable inference. This article gives you the model-side foundation first, then points back to those deployment layers once the mechanics are clear.

Running a model is called inference. For a standard decoder-only LLM, inference is the same loop repeated over and over:

1. Convert your text into tokens. 2. Feed those tokens into the model. 3. Compute scores for every possible next token. 4. Choose one token with a decoding policy. 5. Append that token to the sequence. 6. Repeat until the model stops, the user stops it, or a token limit is reached.

The model is not writing a whole answer in one shot. It is generating one token at a time. Every new token becomes part of the sequence that influences the next token.

Mathematically, the model is a learned function:

> f(theta, sequence) -> probability distribution over next_token

- theta means the model weights. - sequence means the prompt plus generated tokens so far. - Logits are the raw scores before softmax. - Probabilities are the normalized scores after softmax. - Decoding turns those probabilities into one selected token.

This is why local generation speed is measured in tokens per second. Your system repeatedly runs a forward pass, picks or samples a token, updates the KV cache, and continues.

Perception matters here. A long prefill means a long pause before the first word appears. Slow decodemeans the answer streams slowly. Local builders often obsess over decode speed because it is what users feel, but prefill time is what hurts when you paste a 10K-token document.

LLMs do not see raw text as words. They see tokens: Small chunks of text represented internally as integer IDs.

- A whole word: "hello" - A word fragment: "inter", "national", "ization" - A punctuation mark - A whitespace-prefixed string - A byte-level fallback - A special control marker such as <|user|>, <|assistant|>, , or

The tokenizer maps text to token IDs and token IDs back to text. Common tokenizer families include BPE-style tokenizers and SentencePiece-style tokenizers. Different model families use different tokenizers, and that matters. A 4,000-word document may be 5,000 tokens in one tokenizer and 7,500 tokens in another.

Vocabulary size matters too. A tokenizer with a larger vocabulary can compress some text into fewer tokens, but it also changes embedding and output-projection size. This is one reason tokens per second is not perfectly comparable across model families.

Tokens matter because they determine:

- How much text fits in the context window. - How large the KV cache becomes. - How much latency you pay during prompt processing. - Whether multilingual or code-heavy text is efficient. - Whether the model sees special chat markers correctly.

A model's context window is the maximum number of tokens it can attend to at once. In 2026, common local-capable models range from 8K and 32K contexts to 128K, 256K, and even 1M-token contexts in server-class systems.

But supported context length is not the same as cheap, fast, or equally accurate context. A model that can technically handle 128K tokens may slow to a crawl at 64K and lose coherence at 100K. Always test the context lengths you actually plan to use.

Tokens are the unit of work. Once you understand that, long context stops looking magical and starts looking like a bill you can estimate.

to see how text gets broken into tokens in real time.

Most modern LLMs are based on the Transformer architecture. Most local chat LLMs are decoder-only Transformers: They predict the next token while looking back at previous tokens.

Everything above this point, including tokens, weights, config, and chat templates, is setup for the real engine underneath. The Transformer is the skeleton that moves the numbers around.

A simplified Transformer layer contains:

1. Token embeddings: Token IDs become vectors. 2. Positional information: The model needs token order. Many modern LLMs use RoPE (Rotary Position Embeddings), which encodes position by rotating representations. 3. Self-attention: Each token representation looks back at prior token representations and decides what matters. 4. MLP / feed-forward block: A dense nonlinear computation that expands and compresses representations. A large fraction of parameters live here. 5. Layer normalization and residual connections: These stabilize deep networks and help information flow through many layers. 6. Output projection: The final hidden state becomes logits over the vocabulary.

Stack this recipe dozens or hundreds of times and you get a language model.

Transformer recap: Tokens become vectors, attention connects the sequence, MLPs reshape the representation, RoPE keeps position straight, and the final projection turns the last hidden state into next-token logits.

Attention is how a token decides which earlier tokens matter for the next prediction. It is also one of the reasons local inference is so memory-sensitive.

Classic MHA (multi-head attention) stores separate key/value state for many heads. It gives the model flexibility, but it makes the KV cache large.

Modern local models often use more efficient attention designs:

- MQA: Multiple query heads share one key/value head. It is memory-efficient, but can be less expressive. - GQA: Groups of query heads share key/value heads. It is the common middle ground in many current local models. - MHA: Full multi-head attention. It can be strong, but long context gets expensive quickly.

Modern kernels such as FlashAttention and SDPA-style implementations reduce attention memory traffic and keep the GPU busier. A runtime with good attention kernels can be dramatically faster than one without, even on the same model and hardware.

This is why two 7B models can behave very differently at long context. Parameter count is not the whole story. A 7B MHA model at 128K context can exhaust a 24 GB GPU, while a 7B GQA model with the same advertised context may fit with room to spare.

When comparing models, look at attention type, KV heads, context length, and runtime support, not just parameter count.

The KV cache is the model's working memory during generation. It stores key/value attention states for previous tokens so the model does not recompute the entire history from scratch on every generated token.

Without a KV cache, generation would be brutally inefficient. With a KV cache, generation is usable, but the cache consumes memory proportional to:

> tokens x layers x kv_heads x head_dim x precision x 2

The x 2 is for keys and values.

A useful rule of thumb for older Llama-like 7B MHA models is roughly 0.5 MiB per token in FP16 KV cache. That means 4K tokens can cost around 2 GiB just for KV cache. At 32K tokens, you may be looking at 16 GiB of KV cache alone.

Newer GQA/MQA models reduce this substantially. Some runtimes also support FP8 or INT8 KV cache. That is often the practical compression floor I would recommend for local users in 2026.

Do not treat sub-8-bit KV cache as a default. Research systems such as

, and newer compressed-cache kernels show that 2-bit to 4-bit KV can work with careful algorithms, calibration, and custom kernels. That is not the same as casually flipping a Q4 KV toggle in a desktop runtime. Below 8-bit, benchmark hard, especially for coding, tool calls, JSON, long-context retrieval, and tasks where exact earlier tokens matter.

Also do not confuse KV-cache quantization with speculative decoding.

, often shortened informally to DTree, attack decode latency by drafting future tokens and verifying them. They can improve speed, but they do not erase the KV-cache memory bill.

This is why a model can fit at an empty prompt but crash when you load a long document. The weights fit. The working memory did not.

LLM inference has two different performance regimes: prefill and decode.

Prefill processes the prompt you gave the model. If you paste a 20,000-token document, the model must process those 20,000 tokens before it can produce the first answer token. Prefill is relatively parallelizable, so GPUs can handle it efficiently, but it can still be expensive.

The time you spend waiting for the first token to appear is usually prefill time.

Decode generates new tokens one at a time. Each generated token depends on the sequence so far, so decode is much more sequential. This is where the streaming typing effect comes from, and it is usually the phase that determines whether a model feels fast or slow.

Long prompts punish prefill. Long answers punish decode. Long conversations punish both because the KV cache grows.

In a chat session, every turn adds to the cache. If you let a conversation run to 16K tokens, you are paying the memory cost for all 16K tokens on every new token generated. This is why chat UIs that keep infinite history eventually slow down or crash.

After the model produces logits, it has not written anything yet. It has only scored every possible next token. Decoding is the policy that turns those scores into one actual token, appends that token to the context, and repeats the loop.

, can choose tokens in several ways. It can pick the highest-probability token every time. It can sample from a narrowed set of likely tokens. It can penalize repetition. It can stop at a delimiter. It can use a fixed seed so the same prompt behaves reproducibly.

These choices do not change the model weights, but they change the model's voice, determinism, creativity, risk profile, and tendency to loop.

The important knobs answer three practical questions:

- Randomness: How much variation is allowed? - Tail reach: How far into lower-probability tokens can the sampler go? - Boundaries: What prevents loops, rambling, schema breaks, or runaway output?

For precise work, start narrow: low temperature, short max-token limits, explicit stop sequences, and constrained decoding when output must match JSON or a schema. For creative work, give the sampler more room with higher temperature, top-p, and multiple candidates ranked afterward. For coding, keep the first pass conservative, then sample alternatives only when you are intentionally exploring.

Greedy decoding is not always more accurate. It is often brittle. A greedy decoder can get stuck in loops or produce generic answers because it never explores alternatives. For evals, use deterministic settings. For ideation, let the model breathe.

A runnable local LLM is more than one big weight file. A model package usually includes:

- Architecture/config: Layer count, hidden size, attention type, RoPE settings, vocabulary size, special tokens, and context length. - Weights: The learned parameters, often stored as safetensors, GGUF, GPTQ, AWQ, EXL2, or another runtime-specific format. - Tokenizer: The rules that turn text into token IDs and token IDs back into text. - Chat template: The exact markup for system, user, assistant, tool, and reasoning messages. - Generation config: Defaults for temperature, top-p, stop tokens, repetition penalties, and max tokens. - License and model card: The legal and operational instructions for how the model can be used.

The weights are the largest file, but they are not the whole model. If the tokenizer, config, or chat template is wrong, the same weights can feel broken.

The package section tells you what has to travel together. The next section explains why the chat template is the part people most often break.

A chat model was trained with a specific conversation format. For example, it may expect something like:

> <|system|> You are a helpful assistant. <|user|> Explain KV cache. <|assistant|>

Another model may expect:

> [BOS] [INST] Explain KV cache. [/INST]

Another may use ChatML-style markers. Another may require special reasoning tokens. Another may need tool-call XML or JSON wrappers.

Using the wrong format can cause gibberish, role confusion, ignored system prompts, repeated prompts, refusal weirdness, broken tool calls, bad benchmark results, and conclusions that the model is dumb when the template is the actual bug.

- Use the tokenizer's apply_chat_template when using Transformers. - Use model-specific templates in Harbor-backed frontends, llama.cpp, LM Studio, vLLM, or SGLang. - Check whether the model is base, instruct, chat, reasoning, or tool-tuned. - Ensure BOS/EOS tokens are correct. - Keep system prompts short unless they need to be long. - For tool use, follow the exact schema expected by the model/runtime.

If you are building an application that lets users switch models, you need template switching too. Hardcoding one template format and then loading a model that expects another is a common source of bad local-model evals.

Treat the template like an API contract. If you get it wrong, you are not really testing the model you think you are testing.

Not all LLMs are tuned for the same behavior.

For most users, the default starting point should be a recent instruct/chat-tuned model in a size that fits comfortably in memory.

Do not start with a base model unless you know why. Base models complete your prompt rather than answer it. They are useful for researchers, fine-tuners, and people building custom pipelines. They are frustrating for everyone else.

If you ask a base model What is the capital of France?, it might continue with and what is the population of Paris? instead of answering Paris.

The practical split is simple:

- Base model: Good for pretraining research, fine-tuning, and custom pipelines. - Instruct model: Good for direct instruction following. - Chat model: Good for multi-turn dialogue with role formatting. - Reasoning model: Good when the task benefits from extra thinking tokens and verification. - Tool-tuned model: Good when structured calls, JSON, or function use matters.

A local LLM is a model whose weights and inference runtime are under your control. You decide what model runs, how it runs, what data it sees, and what happens to the outputs.

That freedom comes with work. You are now the ops team. You handle downloads, updates, compatibility, memory limits, and security. When something breaks, there is no support ticket to file. There is only you, the logs, and the documentation.

- A 2B parameter model running on a phone. - A 7B to 14B model running on a consumer GPU. - A 30B to 70B model running on a high-end workstation. - A sparse MoE model running on one or more datacenter GPUs. - A private deployment using vLLM, SGLang, TensorRT-LLM, llama.cpp, Harbor, LM Studio, or a custom PyTorch stack.

The key point: local does not automatically mean offline, private, safe, cheap, or opensource. It only means you are running the model yourself. A local app can still phone home. A model can be open-weight but not opensource. A model can be local but unsafe to load. A quantized model can fit in memory but answer poorly.

The tradeoff is worth it when you need privacy, low latency, custom behavior, offline operation, or cost control at scale. It is not worth it when you need the absolute best model quality and do not have the hardware to match. In that case, a hosted API is the right tool.

Local LLMs are practical when you understand one equation:

> Local LLM success = model fit + correct prompt format + good runtime + realistic evals.

Everything else is details. The details matter.

Quantization stores weights in lower precision to reduce memory and sometimes improve throughput.

The 2026 rule of thumb for local users:

- FP16/BF16: Best quality when memory is abundant. Use it as a baseline for evaluation. - Q8 / INT8: Near-lossless for many tasks, but still large. Good when you have VRAM and want minimal quality loss. - Q6 / Q5: Excellent quality with moderate savings. This is a strong middle ground. - Q4: The default consumer sweet spot for many chat and document workflows. - Q3 / Q2: Only when you must fit a bigger model. Math, code, structured output, and tool use degrade first.

Weight quantization is not the same as KV-cache quantization. Weight quantization shrinks the model. KV-cache quantization shrinks the live context memory.

For KV cache, treat FP16/BF16 as the clean baseline and FP8/INT8 as the practical local compression floor. Below 8-bit is research-heavy and workload-sensitive. Use it only after measuring quality on your actual prompts.

Quantization failure shows up first in math, multi-step reasoning, code correctness, tool-use reliability, JSON/schema adherence, subtle instruction following, and long-context retrieval.

A smaller model at higher precision can beat a larger model crushed into too few bits. Do not worship parameter count. A 7B model at Q6 can beat a 13B model at Q2 on reasoning tasks while using less memory and running faster.

safetensors is a safe tensor serialization format designed to store tensors without Python pickle behavior. Use safetensors when possible, especially for PyTorch/Transformers models.

Avoid random .bin files from untrusted sources. PyTorch pickle-based loading can execute arbitrary code during deserialization. Local AI security rule number one: do not let a stranger's model file become a stranger's code execution.

GGUF is the llama.cpp ecosystem's binary model format. Use GGUF when you want llama.cpp, CPU inference, Apple Silicon inference, simple local servers, portable quantized models, or desktop tools like LM Studio.

ONNX is useful for standardized deployment and hardware-specific acceleration, especially outside the usual PyTorch stack. If you are deploying to Intel NPUs, ARM devices, or custom accelerators, ONNX is often the path of least resistance.

TensorRT-LLM is NVIDIA's high-performance i

读法说明 · 点高亮的词查中文释义。登录 HiWord 后,把词收进生词本——下次再读这类文章,已经熟一点。
HiWord.AI · 点词查义、保存生词、刷卡复习 立即体验 →