Excessive token usage with Cyrillic text in Pydantic schema descriptions due to JSON Unicode escaping

### TL;DR
`If you write the description of structured output parameters not in Latin, but, for example, in Cyrillic (Russian, Kazakh, other languages) or in Chinese, Japanese, and others, a lot of tokens are wasted because non-Latin characters are converted to ASCII when translating json to str. In other words, if I write “Имя пользователя - Том” (in English "Username - Tom") during normal response generation, it consumes 5 tokens, but if I create structured output with the parameter "username" and the description “Имя пользователя” and write “Том” in the prompt, it will consume more than 79 tokens, since json in structured output is sent to the OpenAI server and converted to str without ensure_ascii=False. This is very easy to fix, and it will be very useful for languages that do not use the Latin alphabet.`

### Confirm this is an issue with the Python library and not an underlying OpenAI API

- [x] This is an issue with the Python library

### Describe the bug

## Problem Description

When using Pydantic models with Cyrillic text (or other non-ASCII characters) in field descriptions for structured output with `client.beta.chat.completions.parse()`, the token count becomes significantly higher than expected due to Unicode escaping in JSON serialization.

## Expected vs Actual Token Usage

- **Expected**: Cyrillic text should count as UTF-8 encoded characters
- **Actual**: Cyrillic characters are converted to Unicode escape sequences (`\u0410\u043c\u044f...`), dramatically increasing token count

## Root Cause Analysis

The issue appears to stem from Python's `json.dumps()` default behavior of using `ensure_ascii=True`, which converts non-ASCII characters to Unicode escape sequences. This happens during HTTP request serialization when the Pydantic schema is converted to JSON format for the API request.

```python
import json
from pydantic import BaseModel, Field
from openai import OpenAI

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

user_prompt = "My name John"
messages = [{"role": "user", "content": user_prompt}]

class Schema(BaseModel):
    user_name: str = Field(description="Имя пользователя, если он его называл. Если нет, то оставь пустую строку")

schema = Schema.model_json_schema()

print("--- SCHEMA (ensure_ascii=False) ---")
str_schema = json.dumps(schema, ensure_ascii=False)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")

print("\n--- SCHEMA (ensure_ascii=True) ---")
str_schema = json.dumps(schema, ensure_ascii=True)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")

print("\n--- USER PROMPT ---")
num_tokens = len(enc.encode(user_prompt))
print(user_prompt)
print(f"Num tokens: {num_tokens}")

print("\n--- MESSAGES ---")
str_messages = str(messages)
num_tokens = len(enc.encode(str_messages))
print(str_messages)
print(f"Num tokens: {num_tokens}")

with OpenAI(api_key=api_key) as client:
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=messages,
        response_format=Schema,
    )

print("\n--- Prompt tokens (from response) ---")
print(f"Num tokens: {response.usage.prompt_tokens}")
```
Result:
```text
--- SCHEMA (ensure_ascii=False) ---
{"properties": {"user_name": {"description": "Имя пользователя, если он его называл. Если нет, то оставь пустую строку", "title": "User Name", "type": "string"}}, "required": ["user_name"], "title": "Schema", "type": "object"}
Num tokens: 65

--- SCHEMA (ensure_ascii=True) ---
{"properties": {"user_name": {"description": "\u0418\u043c\u044f \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044f, \u0435\u0441\u043b\u0438 \u043e\u043d \u0435\u0433\u043e \u043d\u0430\u0437\u044b\u0432\u0430\u043b. \u0415\u0441\u043b\u0438 \u043d\u0435\u0442, \u0442\u043e \u043e\u0441\u0442\u0430\u0432\u044c \u043f\u0443\u0441\u0442\u0443\u044e \u0441\u0442\u0440\u043e\u043a\u0443", "title": "User Name", "type": "string"}}, "required": ["user_name"], "title": "Schema", "type": "object"}
Num tokens: 233

--- USER PROMPT ---
My name John
Num tokens: 3

--- MESSAGES ---
[{'role': 'user', 'content': 'My name John'}]
Num tokens: 16

--- Prompt tokens (from response) ---
Num tokens: 240
```

## Impact

- Schema with Cyrillic description: **233 tokens** (with Unicode escapes)
- Same schema without escaping would be: **65 tokens**  
- **3.6x token overhead** for non-ASCII text in schema descriptions

## Technical Details

The serialization path appears to be:
1. Pydantic schema → Python dict
2. OpenAI client → HTTP request body (JSON serialization)
3. `json.dumps()` (default `ensure_ascii=True`) → Unicode escapes
4. API server receives escaped version → token counting

## Environment

- **openai**: 1.91.0
- **pydantic**: 2.9.2
- **Python**: 3.12.3

This issue affects any usage of structured output with non-ASCII characters in schema descriptions, particularly impacting users working with languages using Cyrillic, Arabic, Chinese, or other non-ASCII scripts.

### To Reproduce

Run code snippet

### Code snippets

```Python
import json
from pydantic import BaseModel, Field
from openai import OpenAI

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

user_prompt = "My name John"
messages = [{"role": "user", "content": user_prompt}]

class Schema(BaseModel):
    user_name: str = Field(description="Имя пользователя, если он его называл. Если нет, то оставь пустую строку")

schema = Schema.model_json_schema()

print("--- SCHEMA (ensure_ascii=False) ---")
str_schema = json.dumps(schema, ensure_ascii=False)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")

print("\n--- SCHEMA (ensure_ascii=True) ---")
str_schema = json.dumps(schema, ensure_ascii=True)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")

print("\n--- USER PROMPT ---")
num_tokens = len(enc.encode(user_prompt))
print(user_prompt)
print(f"Num tokens: {num_tokens}")

print("\n--- MESSAGES ---")
str_messages = str(messages)
num_tokens = len(enc.encode(str_messages))
print(str_messages)
print(f"Num tokens: {num_tokens}")

with OpenAI(api_key=api_key) as client:
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=messages,
        response_format=Schema,
    )

print("\n--- Prompt tokens (from response) ---")
print(f"Num tokens: {response.usage.prompt_tokens}")
```

### OS

Ubuntu 24.04 LTS

### Python version

3.12.3

### Library version

1.91.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Excessive token usage with Cyrillic text in Pydantic schema descriptions due to JSON Unicode escaping #2428

TL;DR

Confirm this is an issue with the Python library and not an underlying OpenAI API

Describe the bug

Problem Description

Expected vs Actual Token Usage

Root Cause Analysis

Impact

Technical Details

Environment

To Reproduce

Code snippets

OS

Python version

Library version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Excessive token usage with Cyrillic text in Pydantic schema descriptions due to JSON Unicode escaping #2428

Description

TL;DR

Confirm this is an issue with the Python library and not an underlying OpenAI API

Describe the bug

Problem Description

Expected vs Actual Token Usage

Root Cause Analysis

Impact

Technical Details

Environment

To Reproduce

Code snippets

OS

Python version

Library version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions