Skip to content

Commit

Permalink
PEP 756: Add PyUnicode_EXPORT_ALLOW_COPY flag (#3988)
Browse files Browse the repository at this point in the history
  • Loading branch information
vstinner authored Sep 24, 2024
1 parent 680c8b1 commit f085d19
Showing 1 changed file with 35 additions and 9 deletions.
44 changes: 35 additions & 9 deletions peps/pep-0756.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ Add functions to the limited C API version 3.14:
view.
* ``PyUnicode_Import()``: import a Python str object.

In general, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
copy is needed. See the :ref:`specification <export-complexity>` for
cases when a copy is needed.
By default, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
is copied. See the :ref:`specification <export-complexity>` for cases
when a copy is needed.


Rationale
Expand Down Expand Up @@ -95,6 +95,8 @@ Add the following API to the limited C API version 3.14::
#define PyUnicode_FORMAT_UTF8 0x08 // char*
#define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string)

#define PyUnicode_EXPORT_ALLOW_COPY 0x10000

The ``int32_t`` type is used instead of ``int`` to have a well defined
type size and not depend on the platform or the compiler.
See `Avoid C-specific Types
Expand Down Expand Up @@ -150,18 +152,41 @@ flags.

Note that future versions of Python may introduce additional formats.

By default, no memory is copied and no conversion is done.

If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set in
*requested_formats*, the function can copy memory to provide the
requested format and convert from a format to another.

The ``PyUnicode_EXPORT_ALLOW_COPY`` flag is needed to export to
``PyUnicode_FORMAT_UTF8`` a string containing surrogate characters.

Available flags:

=============================== =========== ===================================
Flag Value Description
=============================== =========== ===================================
``PyUnicode_EXPORT_ALLOW_COPY`` ``0x10000`` Allow memory copies and conversions
=============================== =========== ===================================


.. _export-complexity:

Export complexity
-----------------

In general, an export has a complexity of *O*\ (1): no memory copy is
needed. There are cases when a copy is needed, *O*\ (*n*) complexity:
By default, an export has a complexity of *O*\ (1): no memory is copied
and no conversion is done. There is an exception: if only UTF-8 is
requested and the UTF-8 cache is not filled, the string is encoded to
UTF-8 to fill the cache.

If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set, there are cases when a
copy is needed, *O*\ (*n*) complexity:

* If only UCS-2 is requested and the native format is UCS-1.
* If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
* If only UTF-8 is requested: the string is encoded to UTF-8 at the
first call, and then the encoded UTF-8 string is cached.
* If only UTF-8 is requested and the string contains surrogate
characters.

To get the best performance on CPython and PyPy, it's recommended to
support these 4 formats::
Expand Down Expand Up @@ -236,8 +261,8 @@ The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
characters.


Surrogate characters and NUL characters
---------------------------------------
Surrogate characters and embedded NUL characters
------------------------------------------------

Surrogate characters are allowed: they can be imported and exported. For
example, the UTF-8 format uses the ``surrogatepass`` error handler.
Expand Down Expand Up @@ -347,6 +372,7 @@ to return NULL on embedded null characters
Rejecting embedded NUL characters require to scan the string which has
an *O*\ (*n*) complexity.


Reject surrogate characters
---------------------------

Expand Down

0 comments on commit f085d19

Please sign in to comment.