Defining Stopping Criteria in Large Language Models: A Practical Guide

Metric Coders
Mar 29
3 min read

Large Language Models (LLMs) like GPT-4, Claude, and PaLM have transformed how we generate text, code, and even multi-turn conversations. But there’s one underrated piece of the generation puzzle that deeply affects how well these models perform: stopping criteria.

Whether you're building a chatbot, a code assistant, or a text generation pipeline, you eventually need the model to stop generating tokens — at the right time. If it stops too early, your output might be incomplete. Too late, and you get rambling, repetition, or hallucination.

In this post, we’ll explore the different ways you can define stopping criteria for LLMs, with real-world examples and trade-offs.

🚦 1. Token Count / Max Tokens

🔍 What is it?

Limit the number of tokens the model can generate.

✅ Pros

Simple and effective.
Prevents infinite or excessively long outputs.

⚠️ Cons

Might cut off meaningful output mid-sentence.
Doesn’t adapt based on context or content.

💡 Use case

openai.ChatCompletion.create(
    model="gpt-4",
    messages=chat_history,
    max_tokens=512
)

✋ 2. Special Stop Sequences

🔍 What is it?

Define one or more string sequences that, when generated, immediately stop further output.

✅ Pros

Great for structured outputs like JSON, code blocks, or prompts with delimiters.
Works well with tools/functions integration.

⚠️ Cons

Must know or enforce these sequences in the prompt.
Fragile if the model never emits the exact string.

💡 Use case

openai.ChatCompletion.create(
    model="gpt-4",
    messages=chat_history,
    stop=["\nHuman:", "<END>"]
)

🧠 3. Semantic Stopping

🔍 What is it?

Stop when the generated content semantically completes the task (e.g., completes a paragraph, finishes a function, etc.).

✅ Pros

Feels natural and human-like.
More flexible for open-ended generation.

⚠️ Cons

Harder to implement — requires post-processing or heuristics.
May need a second model or logic to evaluate “completion.”

💡 Use case

In multi-turn chat, detect if the assistant finishes with a question or closing sentence:

if response.endswith((".", "!", "?")) and not continuation_needed(response):
    stop_generation = True

🧰 4. Function Calling & Tool Use (Structured Interface)

🔍 What is it?

LLMs like GPT-4 and Claude can be guided with a schema — they stop generating when the function/tool-call is formed correctly.

✅ Pros

Great for building agentic systems.
Guarantees structured outputs.

⚠️ Cons

Overhead in defining function schemas.
Not for freeform text generation.

💡 Use case

{
  "functions": [
    {
      "name": "get_weather",
      "parameters": {
        "type": "object",
        "properties": {
          "location": { "type": "string" }
        }
      }
    }
  ]
}

🌀 5. Sampling-based Criteria (e.g., Repetition Penalties, Top-p, Temperature)

🔍 What is it?

Use token sampling strategies to bias the model toward stopping earlier or reducing rambling.

✅ Pros

More natural termination in creative writing or poetry.
Encourages diverse but coherent outputs.

⚠️ Cons

Indirect — you’re guiding rather than explicitly stopping.
Needs fine-tuning and experimentation.

💡 Use case

openai.ChatCompletion.create(
    model="gpt-4",
    messages=chat_history,
    temperature=0.7,
    top_p=0.9,
    presence_penalty=0.6
)

🧩 6. Custom Callbacks or Streaming Interruption

🔍 What is it?

If using token streaming, you can define your own logic to interrupt generation mid-way.

✅ Pros

Extremely flexible.
Real-time control (great for agents, UI, etc.).

⚠️ Cons

Requires more infrastructure.
Must handle partial outputs gracefully.

💡 Use case

def stop_streaming_callback(token):
    if token == "<END>" or too_many_tokens_seen():
        raise StopGeneration()

# Used in streaming pipelines or async frameworks.

🧠 Bonus: Multi-Criteria Hybrid Approaches

In production, you often combine several of the above:

Set a max_tokens limit.
Define a stop sequence.
Monitor output for semantic completion.
Abort via a custom callback if needed.

This gives you both safety and flexibility.

🧭 Wrapping Up

Stopping criteria isn’t just a technical detail — it shapes the user experience, performance, and reliability of your LLM-powered product.

Whether you’re auto-generating code, writing product descriptions, or managing multi-turn dialogue, pick the criteria (or combo) that fits your use case. A thoughtful stopping strategy can elevate the quality and control of your AI systems by a mile.

Defining Stopping Criteria in Large Language Models: A Practical Guide

🚦 1. Token Count / Max Tokens

🔍 What is it?

✅ Pros

⚠️ Cons

💡 Use case

✋ 2. Special Stop Sequences

🔍 What is it?

✅ Pros

⚠️ Cons

💡 Use case

🧠 3. Semantic Stopping

🔍 What is it?

✅ Pros

⚠️ Cons

💡 Use case

🧰 4. Function Calling & Tool Use (Structured Interface)

🔍 What is it?

✅ Pros

⚠️ Cons

💡 Use case

🌀 5. Sampling-based Criteria (e.g., Repetition Penalties, Top-p, Temperature)

🔍 What is it?

✅ Pros

⚠️ Cons

💡 Use case

🧩 6. Custom Callbacks or Streaming Interruption

🔍 What is it?

✅ Pros

⚠️ Cons

💡 Use case

🧠 Bonus: Multi-Criteria Hybrid Approaches

🧭 Wrapping Up

Related Posts

🔥 LLM Ready Text Generator 🔥: Try Now

Subscribe to get all the updates