Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert/ extend Signature field from a string to structured data. #249

Open
Carreau opened this issue Apr 19, 2023 · 10 comments
Open

Convert/ extend Signature field from a string to structured data. #249

Carreau opened this issue Apr 19, 2023 · 10 comments
Assignees

Comments

@Carreau
Copy link
Member

Carreau commented Apr 19, 2023

It would be nice to store the signature in a more structured form in order to be able to do some operation after the fact.

We probably need to iterate a bit on it, and maybe some part of the signature need should be str anyway, but we might want to store:

  • async or not.
  • Generator function or not.
  • each parameter name, wether they are positional or key-word.
  • The default value (maybe as a string), for each.
  • The type annotation (as a string or structured?)

Even if the rendering in myst, for now we convert all back to string, it's ok, but it would allow us to dynamically modify the rendering to for example not display default values, or not display deprecated parameter, or do search like "is async" or not.

For the default values/annotation, I'm assuming it might be ok to store as string for now, but in the long run I'm hopping we can also compress/expand those and link to relevant object.

For example

DataFrame.aggegate(column:Union[str, List[str]], method=sum,....)

I think in the end we should be able to make tokens like Union, str List link to the relevant object, and maybe
even expand it to if the type is structured

I'm not sure how to do that yet, but I think it's something we should keep in mind.

@Carreau
Copy link
Member Author

Carreau commented Apr 19, 2023

For how to store signature we may want to look into https://pypi.org/project/griffe/

@aktech
Copy link
Collaborator

aktech commented Apr 22, 2023

griffe looks promising so far:

In [1]: import griffe
   ...:
   ...: papyri = griffe.load("papyri")

In [2]: papyri.functions
Out[2]:
{'_intro': <Function('_intro', 228, 235)>,
 'ingest': <Function('ingest', 238, 266)>,
 'install': <Function('install', 272, 401)>,
 'relink': <Function('relink', 404, 414)>,
 'pack': <Function('pack', 477, 481)>,
 'bootstrap': <Function('bootstrap', 484, 492)>,
 'ascii': <Function('ascii', 512, 519)>,
 'serve': <Function('serve', 522, 527)>,
 'serve_static': <Function('serve_static', 530, 544)>,
 'browse': <Function('browse', 547, 551)>,
 'build_parser': <Function('build_parser', 554, 574)>,
 'open': <Function('open', 577, 590)>,
 'sample_async_function': <Function('sample_async_function', 593, 594)>}

In [3]: sample_async_function = papyri.functions["sample_async_function"]

In [4]: vars(sample_async_function)
Out[4]:
{'name': 'sample_async_function',
 'lineno': 593,
 'endlineno': 594,
 'docstring': None,
 'parent': <Module(PosixPath('papyri/__init__.py'))>,
 'members': {},
 'labels': {'async'},
 'imports': {},
 'exports': None,
 'aliases': {},
 'runtime': True,
 '_lines_collection': None,
 '_modules_collection': None,
 'parameters': <griffe.dataclasses.Parameters at 0x1075b75e0>,
 'returns': None,
 'decorators': [],
 'setter': None,
 'deleter': None,
 'overloads': None}

In [5]: ingest_function = papyri.functions["ingest"]

In [6]: ingest_function.parameters['check'].as_dict()
Out[6]:
{'name': 'check',
 'annotation': Name(source='bool', full='bool'),
 'kind': <ParameterKind.positional_or_keyword: 'positional or keyword'>,
 'default': 'False'}

In [7]: ingest_function.parameters._parameters_dict
Out[7]:
{'paths': <griffe.dataclasses.Parameter at 0x107599400>,
 'check': <griffe.dataclasses.Parameter at 0x1075994c0>,
 'relink': <griffe.dataclasses.Parameter at 0x10759b1c0>,
 'dummy_progress': <griffe.dataclasses.Parameter at 0x10759b2b0>}

In [8]: ingest_function.parameters['check'].as_dict()
Out[8]:
{'name': 'check',
 'annotation': Name(source='bool', full='bool'),
 'kind': <ParameterKind.positional_or_keyword: 'positional or keyword'>,
 'default': 'False'}

I think the first step could be to return the griffe's Function object instead of the string signature, what do you think?

@Carreau
Copy link
Member Author

Carreau commented Apr 24, 2023

Yes, I think that should work.

I think you can also use

from griffe.encoders import JSONEncoder
import json
print(json.dumps(ingest_function, cls=JSONEncoder, indent=2))

And that should give you a JSON representation of the function.

Though it seem the type annotations are still strings, but we can take care of this later.

"default": "typer.Option(False, help='Disable rich progress bar')"

Note that griffe is based on ast-maniputation and papyri does actually import the objects. So there might be some
mangling to do.

@pawamoy
Copy link

pawamoy commented May 14, 2023

Hello, chiming in :)

For values (actual values of attributes or default values of parameters), Griffe indeed simply unparses the AST into a string.

For type annotatons however, it builds complete "expressions", which are recursive lists of strings (delimiters like [](){}.,), "names" (which reference other objects and can be resolved to their full path) and expressions (the recursive part).

Storing them like this allows to render signatures with links for each part of each type annotation.

In the future we might store values the same way.

@pawamoy
Copy link

pawamoy commented Aug 25, 2023

Update: Griffe changed its way of storing expressions. They're now a tree of proper objects representing Python code (basically ast with a specialized class for names, to resolve them easily). Also, we now parse attribute values, decorators, and class bases as expressions too! Previously it was just type annotations 🙂

@asmeurer
Copy link
Collaborator

Here's how griffe represents the annotation list[int] | int:

            "annotation": {
              "left": {
                "left": {
                  "name": "list",
                  "cls": "ExprName"
                },
                "slice": {
                  "name": "int",
                  "cls": "ExprName"
                },
                "cls": "ExprSubscript"
              },
              "operator": "|",
              "right": {
                "name": "int",
                "cls": "ExprName"
              },
              "cls": "ExprBinOp"
            },

Reconstructing this would be harder work. It would be simpler to just store the annotation as a string, but this would break compatibility if we wanted to add it in the future. Maybe we can store the string form as annotation_string and leave annotation out for now.

But I'm not clear just how much we want to be compatible with griffe. Also, it would probably be useful if griffe also stored the string form of the annotation in addition to the ast form.

@pawamoy
Copy link

pawamoy commented Sep 21, 2023

It's a bit more readable if we put the class name at the top of each dict:

"annotation": {
  "cls": "ExprBinOp",
  "left": {
    "cls": "ExprSubscript",
    "left": {
      "cls": "ExprName",
      "name": "list",
    },
    "slice": {
      "cls": "ExprName",
      "name": "int",
    },
  },
  "operator": "|",
  "right": {
    "cls": "ExprName",
    "name": "int",
  }
}

I'm not familiar with the Papyri project so I'll suggest things that might not be acceptable to you, please forgive me 🙂

If you actually depend on Griffe, you can directly work with expressions. Calling str on an expression will "unparse" it into a string.

You can iterate on expressions, recursively, in depth-first order, to process specific types of sub-expressions (subscripts, attributes, function calls, etc.) if you need to. With the previous example, your (recursive) iteration would yield:

[
    ExprBinOp(...),
    ExprSubscript(...)
    ExprName("list"),
    "[",
    ExprName("int"),
    "]",
    "|",
    ExprName("int"),
]

If you iterate with flat=True, the recursion is handled for you and the iteration yields names and punctuation directly:

[
    ExprName("list"),
    "[",
    ExprName("int"),
    "]",
    "|",
    ExprName("int"),
]

And that's exactly how Griffe outputs strings from expressions:

def __str__(self):
    return "".join(elem if isinstance(elem, str) else elem.name for elem in self.iterate(flat=True))

You can dump and reload expression to/from JSON:

import json
from griffe.encoders import json_decoder

annotation = json.dumps({
  "cls": "ExprBinOp",
  "left": {
    "cls": "ExprSubscript",
    "left": {
      "cls": "ExprName",
      "name": "list"
    },
    "slice": {
      "cls": "ExprName",
      "name": "int",
    }
  },
  "operator": "|",
  "right": {
    "cls": "ExprName",
    "name": "int"
  }
})

json.loads(annotation, object_hook=json_decoder)
ExprBinOp(left=ExprSubscript(left=ExprName(name='list', parent=None), slice=ExprName(name='int', parent=None)), operator='|', right=ExprName(name='int', parent=None))

...though in this example we don't have the parent information allowing to resolve names to their full path (but it's possible to have it).


If you're only looking for how to store annotations, without having to depend on Griffe, then I'll just share my experience with you: previously I stored them as simple lists of punctuation and names (the flat example above). Well, it is not enough once you start having more demanding rendering options, such as:

  • rendering only the last part of attributes (Sequence instead of typing.Sequence as written in the source)
  • or the opposite, rendering full paths (typing.Mapping instead of just Mapping)
  • unwrapping subscripts (int instead of Annotated[int, Gt(10)])
  • etc.

Ultimately, instead of special casing names, then attributes, then subscripts, then ..., I went all in and now store an actual AST, with just the necessary, additional info (parent) that allows me to resolve names to their full paths (in turn allowing interlinks and cross-references).

It sure is a bit more... chunky, but it's far more robust, powerful, and flexible 😄 Getting the N-th item of an optional tuple before was super prone to errors, now it's a breeze.

@asmeurer
Copy link
Collaborator

asmeurer commented Sep 21, 2023

The griffe JSON is already pretty similar to the JSON generated by the myst_serialize. Here's Griffe for

def def f(x1, /, x2, x3=None, *, x4, x5=None): pass
        "parameters": [
          {
            "name": "x1",
            "annotation": null,
            "kind": "positional-only",
            "default": null
          },
          {
            "name": "x2",
            "annotation": null,
            "kind": "positional or keyword",
            "default": null
          },
          {
            "name": "x3",
            "annotation": null,
            "kind": "positional or keyword",
            "default": "None"
          },
          {
            "name": "x4",
            "annotation": null,
            "kind": "keyword-only",
            "default": null
          },
          {
            "name": "x5",
            "annotation": null,
            "kind": "keyword-only",
            "default": "None"
          }
        ],

And here's what papyri already gives

  "parameters": [
    {
      "annotation": null,
      "default": null,
      "kind": "POSITIONAL_ONLY",
      "name": "x1",
      "type": "ParameterNode"
    },
    {
      "annotation": null,
      "default": null,
      "kind": "POSITIONAL_OR_KEYWORD",
      "name": "x2",
      "type": "ParameterNode"
    },
    {
      "annotation": null,
      "default": "None",
      "kind": "POSITIONAL_OR_KEYWORD",
      "name": "x3",
      "type": "ParameterNode"
    },
    {
      "annotation": null,
      "default": null,
      "kind": "KEYWORD_ONLY",
      "name": "x4",
      "type": "ParameterNode"
    },
    {
      "annotation": null,
      "default": "None",
      "kind": "KEYWORD_ONLY",
      "name": "x5",
      "type": "ParameterNode"
    }
  ],

The biggest difference is that papyri is using the variable name of the inspect._ParameterKind enum, and Griffe is using the human readable form https://github.com/python/cpython/blob/b3af888342db12915f0cdaaacbdc61aadfb62eff/Lib/inspect.py#L2630-L2635

(side note: can someone explain to me why an "int enum" has variables assigned to strings, and how you can actually access those?)

@pawamoy
Copy link

pawamoy commented Sep 21, 2023

(wow)

Looks like each new value gets replaced by an incremented int, while attaching the string to its description attribute:

    def __new__(cls, description):
        value = len(cls.__members__)   # when first value (POSITIONAL_ONLY) is declared, value is 0
        member = int.__new__(cls, value)   # so member is IntEnum(0)  (or something like that??)
        member._value_ = value  # its value is 0
        member.description = description  # its description is 'positional-only'
        return member  # return and proceed to next declared value

Mind blown.

You then get descriptions with _ParameterKind.POSITIONAL_ONLY.description.

@asmeurer
Copy link
Collaborator

Work in progress here #289. Not sure if we should try to store the annotations in a way that is forward compatible in case we want to store an ast like griffe later. I still think it's a good idea to store the str in the JSON even if it can be technically reconstructed later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants