Description
TL;DR
If you write the description of structured output parameters not in Latin, but, for example, in Cyrillic (Russian, Kazakh, other languages) or in Chinese, Japanese, and others, a lot of tokens are wasted because non-Latin characters are converted to ASCII when translating json to str. In other words, if I write “Имя пользователя - Том” (in English "Username - Tom") during normal response generation, it consumes 5 tokens, but if I create structured output with the parameter "username" and the description “Имя пользователя” and write “Том” in the prompt, it will consume more than 79 tokens, since json in structured output is sent to the OpenAI server and converted to str without ensure_ascii=False. This is very easy to fix, and it will be very useful for languages that do not use the Latin alphabet.
Confirm this is an issue with the Python library and not an underlying OpenAI API
- This is an issue with the Python library
Describe the bug
Problem Description
When using Pydantic models with Cyrillic text (or other non-ASCII characters) in field descriptions for structured output with client.beta.chat.completions.parse()
, the token count becomes significantly higher than expected due to Unicode escaping in JSON serialization.
Expected vs Actual Token Usage
- Expected: Cyrillic text should count as UTF-8 encoded characters
- Actual: Cyrillic characters are converted to Unicode escape sequences (
\u0410\u043c\u044f...
), dramatically increasing token count
Root Cause Analysis
The issue appears to stem from Python's json.dumps()
default behavior of using ensure_ascii=True
, which converts non-ASCII characters to Unicode escape sequences. This happens during HTTP request serialization when the Pydantic schema is converted to JSON format for the API request.
import json
from pydantic import BaseModel, Field
from openai import OpenAI
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
user_prompt = "My name John"
messages = [{"role": "user", "content": user_prompt}]
class Schema(BaseModel):
user_name: str = Field(description="Имя пользователя, если он его называл. Если нет, то оставь пустую строку")
schema = Schema.model_json_schema()
print("--- SCHEMA (ensure_ascii=False) ---")
str_schema = json.dumps(schema, ensure_ascii=False)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")
print("\n--- SCHEMA (ensure_ascii=True) ---")
str_schema = json.dumps(schema, ensure_ascii=True)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")
print("\n--- USER PROMPT ---")
num_tokens = len(enc.encode(user_prompt))
print(user_prompt)
print(f"Num tokens: {num_tokens}")
print("\n--- MESSAGES ---")
str_messages = str(messages)
num_tokens = len(enc.encode(str_messages))
print(str_messages)
print(f"Num tokens: {num_tokens}")
with OpenAI(api_key=api_key) as client:
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=messages,
response_format=Schema,
)
print("\n--- Prompt tokens (from response) ---")
print(f"Num tokens: {response.usage.prompt_tokens}")
Result:
--- SCHEMA (ensure_ascii=False) ---
{"properties": {"user_name": {"description": "Имя пользователя, если он его называл. Если нет, то оставь пустую строку", "title": "User Name", "type": "string"}}, "required": ["user_name"], "title": "Schema", "type": "object"}
Num tokens: 65
--- SCHEMA (ensure_ascii=True) ---
{"properties": {"user_name": {"description": "\u0418\u043c\u044f \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044f, \u0435\u0441\u043b\u0438 \u043e\u043d \u0435\u0433\u043e \u043d\u0430\u0437\u044b\u0432\u0430\u043b. \u0415\u0441\u043b\u0438 \u043d\u0435\u0442, \u0442\u043e \u043e\u0441\u0442\u0430\u0432\u044c \u043f\u0443\u0441\u0442\u0443\u044e \u0441\u0442\u0440\u043e\u043a\u0443", "title": "User Name", "type": "string"}}, "required": ["user_name"], "title": "Schema", "type": "object"}
Num tokens: 233
--- USER PROMPT ---
My name John
Num tokens: 3
--- MESSAGES ---
[{'role': 'user', 'content': 'My name John'}]
Num tokens: 16
--- Prompt tokens (from response) ---
Num tokens: 240
Impact
- Schema with Cyrillic description: 233 tokens (with Unicode escapes)
- Same schema without escaping would be: 65 tokens
- 3.6x token overhead for non-ASCII text in schema descriptions
Technical Details
The serialization path appears to be:
- Pydantic schema → Python dict
- OpenAI client → HTTP request body (JSON serialization)
json.dumps()
(defaultensure_ascii=True
) → Unicode escapes- API server receives escaped version → token counting
Environment
- openai: 1.91.0
- pydantic: 2.9.2
- Python: 3.12.3
This issue affects any usage of structured output with non-ASCII characters in schema descriptions, particularly impacting users working with languages using Cyrillic, Arabic, Chinese, or other non-ASCII scripts.
To Reproduce
Run code snippet
Code snippets
import json
from pydantic import BaseModel, Field
from openai import OpenAI
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
user_prompt = "My name John"
messages = [{"role": "user", "content": user_prompt}]
class Schema(BaseModel):
user_name: str = Field(description="Имя пользователя, если он его называл. Если нет, то оставь пустую строку")
schema = Schema.model_json_schema()
print("--- SCHEMA (ensure_ascii=False) ---")
str_schema = json.dumps(schema, ensure_ascii=False)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")
print("\n--- SCHEMA (ensure_ascii=True) ---")
str_schema = json.dumps(schema, ensure_ascii=True)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")
print("\n--- USER PROMPT ---")
num_tokens = len(enc.encode(user_prompt))
print(user_prompt)
print(f"Num tokens: {num_tokens}")
print("\n--- MESSAGES ---")
str_messages = str(messages)
num_tokens = len(enc.encode(str_messages))
print(str_messages)
print(f"Num tokens: {num_tokens}")
with OpenAI(api_key=api_key) as client:
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=messages,
response_format=Schema,
)
print("\n--- Prompt tokens (from response) ---")
print(f"Num tokens: {response.usage.prompt_tokens}")
OS
Ubuntu 24.04 LTS
Python version
3.12.3
Library version
1.91.0