Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEP 756: Give up on copying memory #3999

Merged
merged 3 commits into from
Sep 26, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 47 additions & 47 deletions peps/pep-0756.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,8 @@ Add functions to the limited C API version 3.14:
view.
* ``PyUnicode_Import()``: import a Python str object.

By default, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
is copied. See the :ref:`specification <export-complexity>` for cases
when a copy is needed.
On CPython, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
is copied and no conversion is done.


Rationale
Expand Down Expand Up @@ -67,9 +66,10 @@ possible to write code specialized for UCS formats. A C extension using
the limited C API can only use less efficient code paths and string
formats.

For example, the MarkupSafe project has a C extension specialized for
UCS formats for best performance, and so cannot use the limited C
API.
For example, the `MarkupSafe project
<https://markupsafe.palletsprojects.com/>`_ has a C extension
specialized for UCS formats for best performance, and so cannot use the
limited C API.


Specification
Expand All @@ -95,8 +95,6 @@ Add the following API to the limited C API version 3.14::
#define PyUnicode_FORMAT_UTF8 0x08 // char*
#define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string)

#define PyUnicode_EXPORT_ALLOW_COPY 0x10000

The ``int32_t`` type is used instead of ``int`` to have a well defined
type size and not depend on the platform or the compiler.
See `Avoid C-specific Types
Expand Down Expand Up @@ -148,45 +146,21 @@ UCS-2 and UCS-4 use the native byte order.
*requested_formats* can be a single format or a bitwise combination of the
formats in the table above.
On success, the returned format will be set to a single one of the requested
flags.
formats.

Note that future versions of Python may introduce additional formats.

By default, no memory is copied and no conversion is done.

If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set in
*requested_formats*, the function can copy memory to provide the
requested format and convert from a format to another.

The ``PyUnicode_EXPORT_ALLOW_COPY`` flag is needed to export to
``PyUnicode_FORMAT_UTF8`` a string containing surrogate characters.
No memory is copied and no conversion is done.

Available flags:

=============================== =========== ===================================
Flag Value Description
=============================== =========== ===================================
``PyUnicode_EXPORT_ALLOW_COPY`` ``0x10000`` Allow memory copies and conversions
=============================== =========== ===================================


.. _export-complexity:

Export complexity
-----------------

By default, an export has a complexity of *O*\ (1): no memory is copied
and no conversion is done. There is an exception: if only UTF-8 is
requested and the UTF-8 cache is not filled, the string is encoded to
UTF-8 to fill the cache.

If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set, there are cases when a
copy is needed, *O*\ (*n*) complexity:

* If only UCS-2 is requested and the native format is UCS-1.
* If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
* If only UTF-8 is requested and the string contains surrogate
characters.
On CPython, an export has a complexity of *O*\ (1): no memory is copied
and no conversion is done.

To get the best performance on CPython and PyPy, it's recommended to
support these 4 formats::
Expand Down Expand Up @@ -241,31 +215,29 @@ See ``PyUnicode_Export()`` for the available formats.
UTF-8 format
------------

CPython 3.14 doesn't use the UTF-8 format internally. The format is
provided for compatibility with PyPy which uses UTF-8 natively for
strings. However, in CPython, the encoded UTF-8 string is cached which
makes it convenient to be exported.
CPython 3.14 doesn't use the UTF-8 format internally and doesn't support
exporting a string as UTF-8. The ``PyUnicode_AsUTF8AndSize()`` function
can be used instead.

The ``PyUnicode_FORMAT_UTF8`` format is provided for compatibility with
alternate implementations which may use UTF-8 natively for strings.

On CPython, the UTF-8 format has the lowest priority: ASCII and UCS
formats are preferred.

ASCII format
------------

When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the
``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1
strings.
``PyUnicode_FORMAT_UCS1`` export format is used for ASCII strings.

The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
``PyUnicode_Import()`` to validate that the string only contains ASCII
``PyUnicode_Import()`` to validate that a string only contains ASCII
characters.


Surrogate characters and embedded NUL characters
------------------------------------------------

Surrogate characters are allowed: they can be imported and exported. For
example, the UTF-8 format uses the ``surrogatepass`` error handler.
Surrogate characters are allowed: they can be imported and exported.

Embedded NUL characters are allowed: they can be imported and exported.

Expand Down Expand Up @@ -391,6 +363,34 @@ this issue. For example, the UTF-8 codec can be used with the
characters.


Conversions on demand
---------------------

It would be convenient to convert formats on demand. For example,
convert UCS-1 and UCS-2 to UCS-4 if an export to only UCS-4 is
requested.

The problem is that most users expect an export to require no memory
copy and no conversion: an *O*\ (1) complexity. It is better to have an
API where all operations have an *O*\ (1) complexity.

Export to UTF-8
---------------

CPython 3.14 has a cache to encode a string to UTF-8. It is tempting to
allow exporting to UTF-8.

The problem is that the UTF-8 cache doesn't support surrogate
characters. An export is expected to provide the whole string content,
including embedded NUL characters and surrogate characters. To export
surrogate characters, a different code path using the ``surrogatepass``
error handler is needed and each export operation has to allocate a
temporary buffer: *O*\ (n) complexity.

An export is expected to have an *O*\ (1) complexity, so the idea to
export UTF-8 in CPython was abadonned.


Discussions
===========

Expand Down