Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEP 756: Add PyUnicode_EXPORT_ALLOW_COPY flag #3988

Merged
merged 5 commits into from
Sep 24, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 35 additions & 9 deletions peps/pep-0756.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ Add functions to the limited C API version 3.14:
view.
* ``PyUnicode_Import()``: import a Python str object.

In general, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
copy is needed. See the :ref:`specification <export-complexity>` for
cases when a copy is needed.
By default, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
is copied. See the :ref:`specification <export-complexity>` for cases
when a copy is needed.


Rationale
Expand Down Expand Up @@ -95,6 +95,8 @@ Add the following API to the limited C API version 3.14::
#define PyUnicode_FORMAT_UTF8 0x08 // char*
#define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string)

#define PyUnicode_EXPORT_ALLOW_COPY 0x10000

The ``int32_t`` type is used instead of ``int`` to have a well defined
type size and not depend on the platform or the compiler.
See `Avoid C-specific Types
Expand Down Expand Up @@ -150,18 +152,41 @@ flags.

Note that future versions of Python may introduce additional formats.

By default, no memory is copied and no conversion is done.

If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set in
*requested_formats*, the function can copy memory to provide the
requested format and convert from a format to another.

The ``PyUnicode_EXPORT_ALLOW_COPY`` flag is needed to export to
``PyUnicode_FORMAT_UTF8`` a string containing surrogate characters.

Available flags:

=============================== =========== ===================================
Flag Value Description
=============================== =========== ===================================
``PyUnicode_EXPORT_ALLOW_COPY`` ``0x10000`` Allow memory copies and conversions
=============================== =========== ===================================


.. _export-complexity:

Export complexity
-----------------

In general, an export has a complexity of *O*\ (1): no memory copy is
needed. There are cases when a copy is needed, *O*\ (*n*) complexity:
By default, an export has a complexity of *O*\ (1): no memory is copied
and no conversion is done. There is an exception: if only UTF-8 is
requested and the UTF-8 cache is not filled, the string is encoded to
UTF-8 to fill the cache.

If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set, there are cases when a
copy is needed, *O*\ (*n*) complexity:

* If only UCS-2 is requested and the native format is UCS-1.
* If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
* If only UTF-8 is requested: the string is encoded to UTF-8 at the
first call, and then the encoded UTF-8 string is cached.
* If only UTF-8 is requested and the string contains surrogate
characters.

To get the best performance on CPython and PyPy, it's recommended to
support these 4 formats::
Expand Down Expand Up @@ -236,8 +261,8 @@ The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
characters.


Surrogate characters and NUL characters
---------------------------------------
Surrogate characters and embedded NUL characters
------------------------------------------------

Surrogate characters are allowed: they can be imported and exported. For
example, the UTF-8 format uses the ``surrogatepass`` error handler.
Expand Down Expand Up @@ -347,6 +372,7 @@ to return NULL on embedded null characters
Rejecting embedded NUL characters require to scan the string which has
an *O*\ (*n*) complexity.


Reject surrogate characters
---------------------------

Expand Down