From b9c76ab2854f409aba06dbba5a6a92067f55b7a4 Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Thu, 26 Sep 2024 16:23:59 +0200 Subject: [PATCH 1/3] PEP 756: Give up on copying memory --- peps/pep-0756.rst | 92 +++++++++++++++++++++++------------------------ 1 file changed, 46 insertions(+), 46 deletions(-) diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst index be617468230..940b9b2cb7d 100644 --- a/peps/pep-0756.rst +++ b/peps/pep-0756.rst @@ -21,9 +21,8 @@ Add functions to the limited C API version 3.14: view. * ``PyUnicode_Import()``: import a Python str object. -By default, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory -is copied. See the :ref:`specification ` for cases -when a copy is needed. +On CPython, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory +is copied and no conversion is done. Rationale @@ -67,9 +66,10 @@ possible to write code specialized for UCS formats. A C extension using the limited C API can only use less efficient code paths and string formats. -For example, the MarkupSafe project has a C extension specialized for -UCS formats for best performance, and so cannot use the limited C -API. +For example, the `MarkupSafe project +`_ has a C extension +specialized for UCS formats for best performance, and so cannot use the +limited C API. Specification @@ -95,8 +95,6 @@ Add the following API to the limited C API version 3.14:: #define PyUnicode_FORMAT_UTF8 0x08 // char* #define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string) - #define PyUnicode_EXPORT_ALLOW_COPY 0x10000 - The ``int32_t`` type is used instead of ``int`` to have a well defined type size and not depend on the platform or the compiler. See `Avoid C-specific Types @@ -152,22 +150,8 @@ flags. Note that future versions of Python may introduce additional formats. -By default, no memory is copied and no conversion is done. - -If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set in -*requested_formats*, the function can copy memory to provide the -requested format and convert from a format to another. - -The ``PyUnicode_EXPORT_ALLOW_COPY`` flag is needed to export to -``PyUnicode_FORMAT_UTF8`` a string containing surrogate characters. - -Available flags: +No memory is copied and no conversion is done. -=============================== =========== =================================== -Flag Value Description -=============================== =========== =================================== -``PyUnicode_EXPORT_ALLOW_COPY`` ``0x10000`` Allow memory copies and conversions -=============================== =========== =================================== .. _export-complexity: @@ -175,18 +159,8 @@ Flag Value Description Export complexity ----------------- -By default, an export has a complexity of *O*\ (1): no memory is copied -and no conversion is done. There is an exception: if only UTF-8 is -requested and the UTF-8 cache is not filled, the string is encoded to -UTF-8 to fill the cache. - -If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set, there are cases when a -copy is needed, *O*\ (*n*) complexity: - -* If only UCS-2 is requested and the native format is UCS-1. -* If only UCS-4 is requested and the native format is UCS-1 or UCS-2. -* If only UTF-8 is requested and the string contains surrogate - characters. +On CPython, an export has a complexity of *O*\ (1): no memory is copied +and no conversion is done. To get the best performance on CPython and PyPy, it's recommended to support these 4 formats:: @@ -241,31 +215,29 @@ See ``PyUnicode_Export()`` for the available formats. UTF-8 format ------------ -CPython 3.14 doesn't use the UTF-8 format internally. The format is -provided for compatibility with PyPy which uses UTF-8 natively for -strings. However, in CPython, the encoded UTF-8 string is cached which -makes it convenient to be exported. +CPython 3.14 doesn't use the UTF-8 format internally and doesn't support +exporting a string as UTF-8. The ``PyUnicode_AsUTF8AndSize()`` function +can be used instead. + +The ``PyUnicode_FORMAT_UTF8`` format is provided for compatibility with +PyPy which uses UTF-8 natively for strings. -On CPython, the UTF-8 format has the lowest priority: ASCII and UCS -formats are preferred. ASCII format ------------ When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the -``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1 -strings. +``PyUnicode_FORMAT_UCS1`` export format is used for ASCII strings. The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for -``PyUnicode_Import()`` to validate that the string only contains ASCII +``PyUnicode_Import()`` to validate that a string only contains ASCII characters. Surrogate characters and embedded NUL characters ------------------------------------------------ -Surrogate characters are allowed: they can be imported and exported. For -example, the UTF-8 format uses the ``surrogatepass`` error handler. +Surrogate characters are allowed: they can be imported and exported. Embedded NUL characters are allowed: they can be imported and exported. @@ -391,6 +363,34 @@ this issue. For example, the UTF-8 codec can be used with the characters. +Conversions on demand +--------------------- + +It would be convenient to convert formats on demand. For example, +convert UCS-1 and UCS-2 to UCS-4 if an export to only UCS-4 is +requested. + +The problem is that most users expect an export to require no memory +copy and no conversion: an *O*\ (1) complexity. It is better to have an +API where all operations have an *O*\ (1) complexity. + +Export to UTF-8 +--------------- + +CPython 3.14 has a cache to encode a string to UTF-8. It is tempting to +allow exporting to UTF-8. + +The problem is that the UTF-8 cache doesn't support surrogate +characters. An export is expected to provide the whole string content, +including embedded NUL characters and surrogate characters. To export +surrogate characters, a different code path using the ``surrogatepass`` +error handler is needed and each export operation has to allocate a +temporary buffer: *O*\ (n) complexity. + +An export is expected to have an *O*\ (1) complexity, so the idea to +export UTF-8 in CPython was abadonned. + + Discussions =========== From 8909f03ccc562f6adcdcc122af5bc6c6e86ac41c Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Thu, 26 Sep 2024 16:32:47 +0200 Subject: [PATCH 2/3] flags => formats --- peps/pep-0756.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst index 940b9b2cb7d..6025cb7ee12 100644 --- a/peps/pep-0756.rst +++ b/peps/pep-0756.rst @@ -146,7 +146,7 @@ UCS-2 and UCS-4 use the native byte order. *requested_formats* can be a single format or a bitwise combination of the formats in the table above. On success, the returned format will be set to a single one of the requested -flags. +formats. Note that future versions of Python may introduce additional formats. From f3cd9be32eacafe1cf9e70c3e1de2db7f86efd5d Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Thu, 26 Sep 2024 16:53:06 +0200 Subject: [PATCH 3/3] Update peps/pep-0756.rst Co-authored-by: Steve Dower --- peps/pep-0756.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst index 6025cb7ee12..fd5447e860e 100644 --- a/peps/pep-0756.rst +++ b/peps/pep-0756.rst @@ -220,7 +220,7 @@ exporting a string as UTF-8. The ``PyUnicode_AsUTF8AndSize()`` function can be used instead. The ``PyUnicode_FORMAT_UTF8`` format is provided for compatibility with -PyPy which uses UTF-8 natively for strings. +alternate implementations which may use UTF-8 natively for strings. ASCII format