OpenAI Token Limits and Counting

OpenAI Token Limits and Counting with Tiktoken

Tokens are the fundamental unit of the OpenAI API. Understanding how tokens work, how to count them, and how to stay within model limits is essential for building reliable applications and controlling costs. This guide explains everything you need to know about token management.

What Is a Token?

A token is a piece of text that the model processes as a single unit. Tokens do not map neatly to words or characters. Here are some rules of thumb for English text:

Context Window Limits by Model

Maximum Output Tokens

Each model also has a maximum output limit that is separate from the context window:

Warning: The context window includes both input and output. If your input is 120,000 tokens on gpt-4o, you can only generate up to 8,000 output tokens (128K - 120K), even though the model's max output is 16,384.

Counting Tokens with Tiktoken (Python)

import tiktoken

# Get the encoder for a specific model
enc = tiktoken.encoding_for_model("gpt-4o")

text = "Hello, how are you doing today?"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")
print(f"Decoded: {[enc.decode_single_token_bytes(t) for t in tokens]}")

Counting Chat Message Tokens

Chat messages have overhead beyond just the content text:

import tiktoken

def count_message_tokens(messages, model="gpt-4o"):
    """Count tokens for a list of chat messages."""
    enc = tiktoken.encoding_for_model(model)

    # Every message has overhead tokens
    tokens_per_message = 3  # role, content, separator
    tokens_per_name = 1

    total = 0
    for message in messages:
        total += tokens_per_message
        for key, value in message.items():
            total += len(enc.encode(value))
            if key == "name":
                total += tokens_per_name

    total += 3  # Every reply is primed with assistant
    return total

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

token_count = count_message_tokens(messages)
print(f"Total tokens: {token_count}")

Node.js Token Counting

// Using the js-tiktoken package
import { encodingForModel } from 'js-tiktoken';

const enc = encodingForModel('gpt-4o');

const text = 'Hello, how are you doing today?';
const tokens = enc.encode(text);
console.log(`Token count: ${tokens.length}`);
console.log(`Tokens: ${Array.from(tokens)}`);

enc.free(); // Free WASM memory when done

Managing Token Limits in Conversations

Long conversations will eventually exceed the context window. Implement a strategy to manage this:

import tiktoken
from openai import OpenAI

client = OpenAI(base_url="https://claude4u.com/v1")

MAX_CONTEXT = 120000  # Leave room for output
enc = tiktoken.encoding_for_model("gpt-4o")

def trim_conversation(messages, max_tokens=MAX_CONTEXT):
    """Keep system message and trim oldest user/assistant messages."""
    system = [m for m in messages if m["role"] == "system"]
    history = [m for m in messages if m["role"] != "system"]

    total = sum(len(enc.encode(m["content"])) + 4 for m in system)

    # Add messages from newest to oldest
    kept = []
    for msg in reversed(history):
        msg_tokens = len(enc.encode(msg["content"])) + 4
        if total + msg_tokens > max_tokens:
            break
        kept.insert(0, msg)
        total += msg_tokens

    return system + kept

# Use in your chat loop
conversation = [
    {"role": "system", "content": "You are a helpful assistant."}
]

while True:
    user_input = input("You: ")
    conversation.append({"role": "user", "content": user_input})

    # Trim if needed
    conversation = trim_conversation(conversation)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=conversation,
        max_tokens=4096
    )

    reply = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": reply})
    print(f"Assistant: {reply}")

Summarization Strategy

Instead of simply truncating old messages, summarize them to retain context:

def summarize_old_messages(client, old_messages):
    """Compress old conversation into a summary."""
    old_text = "\n".join([f"{m['role']}: {m['content']}" for m in old_messages])

    summary_response = client.chat.completions.create(
        model="gpt-4o-mini",  # Use cheaper model for summarization
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation in 200 words:\n{old_text}"
        }],
        max_tokens=300
    )
    return summary_response.choices[0].message.content
Tip: Always set max_tokens in your API calls to prevent unexpectedly long responses. Without it, the model may generate tokens up to its output limit, resulting in higher costs and longer response times.

Checking Token Usage in Responses

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

usage = response.usage
print(f"Prompt tokens: {usage.prompt_tokens}")
print(f"Completion tokens: {usage.completion_tokens}")
print(f"Total tokens: {usage.total_tokens}")

# Estimate cost
input_cost = usage.prompt_tokens * 2.50 / 1_000_000
output_cost = usage.completion_tokens * 10.00 / 1_000_000
print(f"Estimated cost: ${input_cost + output_cost:.6f}")
Tip: claude4u.com provides detailed token usage tracking and cost breakdowns in its dashboard, making it easy to monitor token consumption across all your projects and models without building custom tracking infrastructure.

Get Started with 轻舟 AI

Stable, fast AI API relay — supports Claude, OpenAI, Gemini and more

Sign Up Free