🧠 Decoding Strategies in Language Models: How Do LLMs Pick the Next Word?

Suhas Bhairav
Mar 29
3 min read

When you prompt a large language model (LLM) like ChatGPT, Claude, or Mistral, the model doesn’t just magically “know” what to say next.

Under the hood, it considers many possible next tokens—each with a probability—and uses a decoding strategy to decide which one to output.

This decision is what shapes everything from tone to creativity, factual accuracy to fluency.

In this post, we’ll break down the most popular decoding strategies and help you choose the right one for your use case.

🧩 What is Decoding?

Decoding is the process by which a language model converts probabilities over tokens into actual output text.

After analyzing your prompt, the model generates a probability distribution for possible next tokens. A decoding strategy then samples or selects one of those tokens.

Repeat this token-by-token, and you get a full sentence, paragraph, or entire essay.

⚙️ Common Decoding Strategies

1. Greedy Decoding

✅ Simple, fast, but not always smart.

How it works: Always picks the token with the highest probability at each step.
Pros: Fast and deterministic.
Cons: Can get stuck in loops or generate dull, repetitive text.

🧪 Example:Prompt: "The cat sat on the"Output: "mat and then the cat sat on the mat and then the cat..."

2. Beam Search

🔎 Looks ahead to find the best sequence overall.

How it works: Keeps multiple candidate sequences (beams) and explores likely paths.
Beam Width: The number of sequences it keeps track of.
Pros: Produces more coherent and globally optimized outputs.
Cons: Still deterministic; can lack creativity.

🎯 Best for: Summarization, translation, legal documents

3. Sampling (a.k.a. Multinomial Sampling)

🎲 Adds randomness for creative outputs.

How it works: Samples the next token from the probability distribution (not always the top one).
Controlled by: temperature
Pros: Diverse and creative outputs.
Cons: Can lead to incoherent or off-topic results at high temperatures.

🎨 Best for: Creative writing, brainstorming, poetry

4. Top-k Sampling

🧺 Limit choices to the top k tokens, then sample.

How it works: From the top k most probable tokens, randomly pick one.
Pros: Avoids unlikely/garbage tokens; adds controlled diversity.
Cons: Picking a good k can be tricky.

⚙️ Tip: k = 40 is a common default.

5. Top-p Sampling (a.k.a. Nucleus Sampling)

🧠 Smarter sampling that adapts to each context.

How it works: Instead of a fixed k, it selects the smallest set of top tokens whose cumulative probability is ≥ p (e.g., 0.9), and samples from that.
Pros: Dynamic, more fluent and context-aware.
Cons: Slightly harder to interpret and tune.

✅ Most commonly used in production models today.

6. Typical Sampling

📊 Picks tokens that are statistically “typical” — neither too predictable nor too rare.

How it works: Filters out unusually high or low entropy tokens.
Pros: Balances creativity and coherence.
Cons: Still under exploration in many libraries, but promising.

🧠 Strategy Comparison Table

Strategy	Deterministic	Creative	Coherent	Use Case
Greedy	✅ Yes	❌ No	⚠️ Sometimes	Factual Q&A
Beam Search	✅ Yes	❌ No	✅ Strong	Translation, Summarization
Sampling	❌ No	✅ Yes	⚠️ Varies	Poetry, Brainstorming
Top-k Sampling	❌ No	✅ Yes	✅ Better	Chat, Copywriting
Top-p Sampling	❌ No	✅ Yes	✅ Great	General-purpose LLMs
Typical Sampling	❌ No	✅ Yes	✅ Balanced	Experimental

🛠️ Choosing the Right Strategy

Here’s how to pick based on your product:

🤖 Chatbots / Assistants: Top-p (p = 0.8–0.95), temp = 0.7
✍️ Creative Writing: Sampling or Top-p, temp = 0.9+
📑 Summarization / Translation: Beam Search or Top-p, low temp
📚 Educational / Factual: Greedy or Top-k with low temperature
🧪 Research Experiments: Try Typical Sampling or hybrid strategies

🚀 Final Thoughts

The decoding strategy you choose can drastically change how your model behaves. Want safer and more reliable answers? Go low-temp + deterministic. Want surprise, humor, and novelty? Loosen up and let the model explore.

Your LLM’s “voice” depends not just on the prompt—but on how you decode.