top of page

Defining Stopping Criteria in Large Language Models: A Practical Guide

Large Language Models (LLMs) like GPT-4, Claude, and PaLM have transformed how we generate text, code, and even multi-turn conversations. But there’s one underrated piece of the generation puzzle that deeply affects how well these models perform: stopping criteria.

Whether you're building a chatbot, a code assistant, or a text generation pipeline, you eventually need the model to stop generating tokens — at the right time. If it stops too early, your output might be incomplete. Too late, and you get rambling, repetition, or hallucination.



Stopping Criteria in LLMs
Stopping Criteria in LLMs


In this post, we’ll explore the different ways you can define stopping criteria for LLMs, with real-world examples and trade-offs.


🚦 1. Token Count / Max Tokens

🔍 What is it?

Limit the number of tokens the model can generate.

✅ Pros

  • Simple and effective.

  • Prevents infinite or excessively long outputs.

⚠️ Cons

  • Might cut off meaningful output mid-sentence.

  • Doesn’t adapt based on context or content.

💡 Use case

openai.ChatCompletion.create(
    model="gpt-4",
    messages=chat_history,
    max_tokens=512
)

✋ 2. Special Stop Sequences

🔍 What is it?

Define one or more string sequences that, when generated, immediately stop further output.

✅ Pros

  • Great for structured outputs like JSON, code blocks, or prompts with delimiters.

  • Works well with tools/functions integration.

⚠️ Cons

  • Must know or enforce these sequences in the prompt.

  • Fragile if the model never emits the exact string.

💡 Use case

openai.ChatCompletion.create(
    model="gpt-4",
    messages=chat_history,
    stop=["\nHuman:", "<END>"]
)

🧠 3. Semantic Stopping

🔍 What is it?

Stop when the generated content semantically completes the task (e.g., completes a paragraph, finishes a function, etc.).

✅ Pros

  • Feels natural and human-like.

  • More flexible for open-ended generation.

⚠️ Cons

  • Harder to implement — requires post-processing or heuristics.

  • May need a second model or logic to evaluate “completion.”

💡 Use case

In multi-turn chat, detect if the assistant finishes with a question or closing sentence:

if response.endswith((".", "!", "?")) and not continuation_needed(response):
    stop_generation = True

🧰 4. Function Calling & Tool Use (Structured Interface)

🔍 What is it?

LLMs like GPT-4 and Claude can be guided with a schema — they stop generating when the function/tool-call is formed correctly.

✅ Pros

  • Great for building agentic systems.

  • Guarantees structured outputs.

⚠️ Cons

  • Overhead in defining function schemas.

  • Not for freeform text generation.

💡 Use case

{
  "functions": [
    {
      "name": "get_weather",
      "parameters": {
        "type": "object",
        "properties": {
          "location": { "type": "string" }
        }
      }
    }
  ]
}

🌀 5. Sampling-based Criteria (e.g., Repetition Penalties, Top-p, Temperature)

🔍 What is it?

Use token sampling strategies to bias the model toward stopping earlier or reducing rambling.

✅ Pros

  • More natural termination in creative writing or poetry.

  • Encourages diverse but coherent outputs.

⚠️ Cons

  • Indirect — you’re guiding rather than explicitly stopping.

  • Needs fine-tuning and experimentation.

💡 Use case

openai.ChatCompletion.create(
    model="gpt-4",
    messages=chat_history,
    temperature=0.7,
    top_p=0.9,
    presence_penalty=0.6
)

🧩 6. Custom Callbacks or Streaming Interruption

🔍 What is it?

If using token streaming, you can define your own logic to interrupt generation mid-way.

✅ Pros

  • Extremely flexible.

  • Real-time control (great for agents, UI, etc.).

⚠️ Cons

  • Requires more infrastructure.

  • Must handle partial outputs gracefully.

💡 Use case

def stop_streaming_callback(token):
    if token == "<END>" or too_many_tokens_seen():
        raise StopGeneration()

# Used in streaming pipelines or async frameworks.

🧠 Bonus: Multi-Criteria Hybrid Approaches

In production, you often combine several of the above:

  • Set a max_tokens limit.

  • Define a stop sequence.

  • Monitor output for semantic completion.

  • Abort via a custom callback if needed.

This gives you both safety and flexibility.


🧭 Wrapping Up

Stopping criteria isn’t just a technical detail — it shapes the user experience, performance, and reliability of your LLM-powered product.

Whether you’re auto-generating code, writing product descriptions, or managing multi-turn dialogue, pick the criteria (or combo) that fits your use case. A thoughtful stopping strategy can elevate the quality and control of your AI systems by a mile.

🔥 LLM Ready Text Generator 🔥: Try Now

Subscribe to get all the updates

© 2025 Metric Coders. All Rights Reserved

bottom of page