Skip to content

Smart string usage

Abel Cheung edited this page Mar 26, 2023 · 4 revisions

Smart string intro

Smart string is a private str subclass documented in return types of XPath evaluation result. Directly quoting from lxml documentation:

XPath string results are 'smart' in that they provide a getparent() method that knows their origin:

  • for attribute values, result.getparent() returns the Element that carries them. An example is //foo/@attribute, where the parent would be a foo Element.
  • for the text() function (as in //text()), it returns the Element that contains the text or tail that was returned.

The actual class is named _ElementUnicodeResult in source code. Although for Python 2.x and PyPy this str subclass represents some other concrete classes, we can forget them as far as type checking is concerned.

Important notice

Following are breaking changes since 2023.2.11.

Class rename

Historically the class is named SmartStr in annotation package, which is more user friendly but need to be imported manually for typing. Being underused, it is decided to break compatibility and revert to concrete class name (_ElementUnicodeResult) instead.

Class specialization

Because getparent() method needs to known original element type, smart string is modified as a Generic class, containing the element type as subscript, as in _ElementUnicodeResult[_Element].

Version Usage
2023.02.11 or earlier SmartStr
Afterwards _ElementUnicodeResult[_Element]

How to use

There are 2 occasions where this class is primarily useful. See further down for examples of both types of usage.

  1. XPath selection result
  2. HtmlElement.text_content() result (which uses XPath internally)

However this class is almost never used directly in type annotation, since XPath result is too versatile to be annotated (str, float, bool, list of them, as well as list of _Element and namespace tuples).

Users are therefore expected to narrow down XPath selection result themselves. First example code below shows how to handle smart strings in selection result.


XPath selection result

from lxml import etree
from typing import TypeGuard  # (or from typing_extensions)

def is_smart_str(s: str) -> TypeGuard[etree._ElementUnicodeResult[etree._Element]]:
    return hasattr(s, 'getparent')

tree = etree.parse(<...some html file...>)

for result in tree.xpath('//div/span/text()'):
    if is_smart_str(result):
        # At this point,
        # result -> _ElementUnicodeResult[_Element],
        # parent -> Optional[_Element]
        parent = result.getparent()
        if parent is not None:
            print(parent.tag)  # 'span'

From HtmlElement

from lxml import html

tree = html.parse('index.html')  # _ElementTree[HtmlElement]
form = tree.getroot().forms[0]  # FormElement
form_content = form.text_content()  # _ElementUnicodeResult[FormElement]
# parent is identified as Optional[FormElement] during type
# check; but in runtime it is always None due to implementation detail
parent = form_content.getparent()