OpenAI Token Limits and Counting
OpenAI Token Limits and Counting with Tiktoken
Tokens are the fundamental unit of the OpenAI API. Understanding how tokens work, how to count them, and how to stay within model limits is essential for building reliable applications and controlling costs. This guide explains everything you need to know about token management.
What Is a Token?
A token is a piece of text that the model processes as a single unit. Tokens do not map neatly to words or characters. Here are some rules of thumb for English text:
- 1 token is approximately 4 characters or 0.75 words
- "Hello" is 1 token
- "ChatGPT" is 2 tokens: "Chat" + "GPT"
- "Supercalifragilisticexpialidocious" is 7 tokens
- Whitespace and punctuation are typically separate tokens
- Numbers are tokenized digit by digit: "2024" is 1-2 tokens
Context Window Limits by Model
- gpt-4o — 128,000 tokens (input + output combined)
- gpt-4o-mini — 128,000 tokens
- gpt-4-turbo — 128,000 tokens
- gpt-4 — 8,192 tokens
- gpt-3.5-turbo — 16,385 tokens
- o1 — 200,000 tokens
- o3-mini — 200,000 tokens
Maximum Output Tokens
Each model also has a maximum output limit that is separate from the context window:
- gpt-4o — 16,384 output tokens
- gpt-4o-mini — 16,384 output tokens
- gpt-4-turbo — 4,096 output tokens
- o1 — 100,000 output tokens (including reasoning)
Counting Tokens with Tiktoken (Python)
import tiktoken
# Get the encoder for a specific model
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Hello, how are you doing today?"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")
print(f"Decoded: {[enc.decode_single_token_bytes(t) for t in tokens]}")
Counting Chat Message Tokens
Chat messages have overhead beyond just the content text:
import tiktoken
def count_message_tokens(messages, model="gpt-4o"):
"""Count tokens for a list of chat messages."""
enc = tiktoken.encoding_for_model(model)
# Every message has overhead tokens
tokens_per_message = 3 # role, content, separator
tokens_per_name = 1
total = 0
for message in messages:
total += tokens_per_message
for key, value in message.items():
total += len(enc.encode(value))
if key == "name":
total += tokens_per_name
total += 3 # Every reply is primed with assistant
return total
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
token_count = count_message_tokens(messages)
print(f"Total tokens: {token_count}")
Node.js Token Counting
// Using the js-tiktoken package
import { encodingForModel } from 'js-tiktoken';
const enc = encodingForModel('gpt-4o');
const text = 'Hello, how are you doing today?';
const tokens = enc.encode(text);
console.log(`Token count: ${tokens.length}`);
console.log(`Tokens: ${Array.from(tokens)}`);
enc.free(); // Free WASM memory when done
Managing Token Limits in Conversations
Long conversations will eventually exceed the context window. Implement a strategy to manage this:
import tiktoken
from openai import OpenAI
client = OpenAI(base_url="https://claude4u.com/v1")
MAX_CONTEXT = 120000 # Leave room for output
enc = tiktoken.encoding_for_model("gpt-4o")
def trim_conversation(messages, max_tokens=MAX_CONTEXT):
"""Keep system message and trim oldest user/assistant messages."""
system = [m for m in messages if m["role"] == "system"]
history = [m for m in messages if m["role"] != "system"]
total = sum(len(enc.encode(m["content"])) + 4 for m in system)
# Add messages from newest to oldest
kept = []
for msg in reversed(history):
msg_tokens = len(enc.encode(msg["content"])) + 4
if total + msg_tokens > max_tokens:
break
kept.insert(0, msg)
total += msg_tokens
return system + kept
# Use in your chat loop
conversation = [
{"role": "system", "content": "You are a helpful assistant."}
]
while True:
user_input = input("You: ")
conversation.append({"role": "user", "content": user_input})
# Trim if needed
conversation = trim_conversation(conversation)
response = client.chat.completions.create(
model="gpt-4o",
messages=conversation,
max_tokens=4096
)
reply = response.choices[0].message.content
conversation.append({"role": "assistant", "content": reply})
print(f"Assistant: {reply}")
Summarization Strategy
Instead of simply truncating old messages, summarize them to retain context:
def summarize_old_messages(client, old_messages):
"""Compress old conversation into a summary."""
old_text = "\n".join([f"{m['role']}: {m['content']}" for m in old_messages])
summary_response = client.chat.completions.create(
model="gpt-4o-mini", # Use cheaper model for summarization
messages=[{
"role": "user",
"content": f"Summarize this conversation in 200 words:\n{old_text}"
}],
max_tokens=300
)
return summary_response.choices[0].message.content
max_tokens in your API calls to prevent unexpectedly long responses. Without it, the model may generate tokens up to its output limit, resulting in higher costs and longer response times.
Checking Token Usage in Responses
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
usage = response.usage
print(f"Prompt tokens: {usage.prompt_tokens}")
print(f"Completion tokens: {usage.completion_tokens}")
print(f"Total tokens: {usage.total_tokens}")
# Estimate cost
input_cost = usage.prompt_tokens * 2.50 / 1_000_000
output_cost = usage.completion_tokens * 10.00 / 1_000_000
print(f"Estimated cost: ${input_cost + output_cost:.6f}")
Get Started with 轻舟 AI
Stable, fast AI API relay — supports Claude, OpenAI, Gemini and more
Sign Up Free
轻舟 AI