Skip to content

Excessive token usage with Cyrillic text in Pydantic schema descriptions due to JSON Unicode escaping #2428

Open
@tg-bomze

Description

@tg-bomze

TL;DR

If you write the description of structured output parameters not in Latin, but, for example, in Cyrillic (Russian, Kazakh, other languages) or in Chinese, Japanese, and others, a lot of tokens are wasted because non-Latin characters are converted to ASCII when translating json to str. In other words, if I write “Имя пользователя - Том” (in English "Username - Tom") during normal response generation, it consumes 5 tokens, but if I create structured output with the parameter "username" and the description “Имя пользователя” and write “Том” in the prompt, it will consume more than 79 tokens, since json in structured output is sent to the OpenAI server and converted to str without ensure_ascii=False. This is very easy to fix, and it will be very useful for languages that do not use the Latin alphabet.

Confirm this is an issue with the Python library and not an underlying OpenAI API

  • This is an issue with the Python library

Describe the bug

Problem Description

When using Pydantic models with Cyrillic text (or other non-ASCII characters) in field descriptions for structured output with client.beta.chat.completions.parse(), the token count becomes significantly higher than expected due to Unicode escaping in JSON serialization.

Expected vs Actual Token Usage

  • Expected: Cyrillic text should count as UTF-8 encoded characters
  • Actual: Cyrillic characters are converted to Unicode escape sequences (\u0410\u043c\u044f...), dramatically increasing token count

Root Cause Analysis

The issue appears to stem from Python's json.dumps() default behavior of using ensure_ascii=True, which converts non-ASCII characters to Unicode escape sequences. This happens during HTTP request serialization when the Pydantic schema is converted to JSON format for the API request.

import json
from pydantic import BaseModel, Field
from openai import OpenAI

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

user_prompt = "My name John"
messages = [{"role": "user", "content": user_prompt}]

class Schema(BaseModel):
    user_name: str = Field(description="Имя пользователя, если он его называл. Если нет, то оставь пустую строку")

schema = Schema.model_json_schema()

print("--- SCHEMA (ensure_ascii=False) ---")
str_schema = json.dumps(schema, ensure_ascii=False)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")

print("\n--- SCHEMA (ensure_ascii=True) ---")
str_schema = json.dumps(schema, ensure_ascii=True)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")

print("\n--- USER PROMPT ---")
num_tokens = len(enc.encode(user_prompt))
print(user_prompt)
print(f"Num tokens: {num_tokens}")

print("\n--- MESSAGES ---")
str_messages = str(messages)
num_tokens = len(enc.encode(str_messages))
print(str_messages)
print(f"Num tokens: {num_tokens}")

with OpenAI(api_key=api_key) as client:
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=messages,
        response_format=Schema,
    )

print("\n--- Prompt tokens (from response) ---")
print(f"Num tokens: {response.usage.prompt_tokens}")

Result:

--- SCHEMA (ensure_ascii=False) ---
{"properties": {"user_name": {"description": "Имя пользователя, если он его называл. Если нет, то оставь пустую строку", "title": "User Name", "type": "string"}}, "required": ["user_name"], "title": "Schema", "type": "object"}
Num tokens: 65

--- SCHEMA (ensure_ascii=True) ---
{"properties": {"user_name": {"description": "\u0418\u043c\u044f \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044f, \u0435\u0441\u043b\u0438 \u043e\u043d \u0435\u0433\u043e \u043d\u0430\u0437\u044b\u0432\u0430\u043b. \u0415\u0441\u043b\u0438 \u043d\u0435\u0442, \u0442\u043e \u043e\u0441\u0442\u0430\u0432\u044c \u043f\u0443\u0441\u0442\u0443\u044e \u0441\u0442\u0440\u043e\u043a\u0443", "title": "User Name", "type": "string"}}, "required": ["user_name"], "title": "Schema", "type": "object"}
Num tokens: 233

--- USER PROMPT ---
My name John
Num tokens: 3

--- MESSAGES ---
[{'role': 'user', 'content': 'My name John'}]
Num tokens: 16

--- Prompt tokens (from response) ---
Num tokens: 240

Impact

  • Schema with Cyrillic description: 233 tokens (with Unicode escapes)
  • Same schema without escaping would be: 65 tokens
  • 3.6x token overhead for non-ASCII text in schema descriptions

Technical Details

The serialization path appears to be:

  1. Pydantic schema → Python dict
  2. OpenAI client → HTTP request body (JSON serialization)
  3. json.dumps() (default ensure_ascii=True) → Unicode escapes
  4. API server receives escaped version → token counting

Environment

  • openai: 1.91.0
  • pydantic: 2.9.2
  • Python: 3.12.3

This issue affects any usage of structured output with non-ASCII characters in schema descriptions, particularly impacting users working with languages using Cyrillic, Arabic, Chinese, or other non-ASCII scripts.

To Reproduce

Run code snippet

Code snippets

import json
from pydantic import BaseModel, Field
from openai import OpenAI

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

user_prompt = "My name John"
messages = [{"role": "user", "content": user_prompt}]

class Schema(BaseModel):
    user_name: str = Field(description="Имя пользователя, если он его называл. Если нет, то оставь пустую строку")

schema = Schema.model_json_schema()

print("--- SCHEMA (ensure_ascii=False) ---")
str_schema = json.dumps(schema, ensure_ascii=False)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")

print("\n--- SCHEMA (ensure_ascii=True) ---")
str_schema = json.dumps(schema, ensure_ascii=True)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")

print("\n--- USER PROMPT ---")
num_tokens = len(enc.encode(user_prompt))
print(user_prompt)
print(f"Num tokens: {num_tokens}")

print("\n--- MESSAGES ---")
str_messages = str(messages)
num_tokens = len(enc.encode(str_messages))
print(str_messages)
print(f"Num tokens: {num_tokens}")

with OpenAI(api_key=api_key) as client:
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=messages,
        response_format=Schema,
    )

print("\n--- Prompt tokens (from response) ---")
print(f"Num tokens: {response.usage.prompt_tokens}")

OS

Ubuntu 24.04 LTS

Python version

3.12.3

Library version

1.91.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions