Should we implement a rst parser? #42

JulienPalard · 2022-09-13T08:30:53Z

Instead of working with regexes.

In one hand I don't think so, because:

Some cases detected by sphinx-lint are valid syntax, but invalid usage.
Some cases are invalid syntax.

But in the other hand, having a proper AST to match invalid usage could be better than relying on regex, and a good parser may also be good at reporting invalid syntax.

ezio-melotti · 2022-09-16T05:32:33Z

With "implement" do you mean write from scratch, or adopt an existing parser?
IMHO both a parser and regex can coexist. For example, you can use the parser to parse the elements of a list, and then a regex to find broken roles within each individual element. Depending on how you implement it though, it might become complex.
Also speed was one of your initial goals IIRC, do you know how much a parser is going to affect performances?

A parser would also be useful for the following issues:

JulienPalard · 2022-09-26T13:05:20Z

Notes on the parser idea:

Writing a reStructuredText parser is not trivial as the directive sometime let them body be parsed (what they call nested_parse), some don't. And projects are allowed to extend the rst syntax by adding their own directive (that may, or may not, allow rst as their body).

This make for some ambiguous situations, like:

.. danger::

    This is an error*


.. testsetup::

    from this_is_not_an_error import *

A dumb parser can't tell if the content of directives should be parsed or not. Stopping at "it's a directive with a name, arguments, and text content" is not OK as some directives hold tons of valid rst lines, think of the class Python directive containing an attribute directive containing an impl-detail directive containing a verisonchanged directive like:

.. class:: Parameter(name, kind, *, default=Parameter.empty, annotation=Parameter.empty)

   Parameter objects are *immutable*.  Instead of modifying a Parameter object,
   you can use :meth:`Parameter.replace` to create a modified copy.

   .. attribute:: Parameter.name

      The name of the parameter as a string.  The name must be a valid
      Python identifier.

      .. impl-detail::

         CPython generates implicit parameter names of the form ``.0`` on the
         code objects used to implement comprehensions and generator
         expressions.

         .. versionchanged:: 3.6
            These parameter names are exposed by this module as names like
            ``implicit0``.

and trying to parse the content of all directives is not OK neither as it would quickly lead to rst syntax errors being reported in directives expected to contain something that is not rst.

JulienPalard · 2022-09-26T20:16:23Z

Also (on my machine) sphinx-lint on cpython takes 1.6s while parsing using docutils/Sphinx takes 40s.

ezio-melotti · 2022-09-26T20:48:35Z

If speed is a concern, the use of the parser could be optional, even though that might lead to some duplication between regex-based checkers and parser-based ones.

AA-Turner · 2022-09-27T09:43:43Z

Also (on my machine) sphinx-lint on cpython takes 1.6s while parsing using docutils/Sphinx takes 40s.

@JulienPalard can you share a test script? A medium/long term goal of mine is speeding up Docutils (and Sphinx resultingly) so benchmarks would be useful.

A

JulienPalard · 2022-09-27T12:20:16Z

Don't hope for something pretty, I was playing with the parser using this:

import sys
from collections import defaultdict
from pathlib import Path
import json
import docutils.parsers.rst


class LyingDefaultDict(defaultdict):
    def __contains__(self, key):
        return True

    def __delitem__(self, key):
        try:
            super().__delitem__(key)
        except KeyError:
            pass


class DummyDirective(docutils.parsers.rst.Directive):
    has_content = True

    def run(self):
        node = docutils.nodes.raw(
            "",
            json.dumps(
                {
                    "type": "directive",
                    "name": self.name,
                    "arguments": self.arguments,
                    "options": self.options,
                    "content": list(self.content),
                    "lineno": self.lineno,
                    "content_offset": self.content_offset,
                    "block_text": self.block_text,
                }, indent=4
            ),
        )
        try:
            self.state.nested_parse(self.content, self.content_offset, node)
        except docutils.utils.SystemMessage:
            pass
        return [node]


def dummy_role(name, rawtext, text, lineno, inliner, options=None, content=None):
    return [
        docutils.nodes.raw(
            "",
            json.dumps(
                {
                    "type": "role",
                    "name": name,
                    "rawtext": rawtext,
                    "text": text,
                    "lineno": lineno,
                    "options": options,
                    "content": content,
                }, indent=4
            ),
        )
    ], []



# Tell docutils that all roles are my dummy role
docutils.parsers.rst.roles._roles = LyingDefaultDict(lambda: dummy_role)

# Tell docutils that all directives are my dummy directive
docutils.parsers.rst.directives._directives = LyingDefaultDict(lambda: DummyDirective)

parser = docutils.parsers.rst.Parser()
settings = docutils.frontend.OptionParser(
    components=(docutils.parsers.rst.Parser,)
).get_default_values()

source = sys.argv[1]
input = Path(source).read_text(encoding="UTF-8")
document = docutils.utils.new_document(source, settings=settings)
parser.parse(input, document)
print(document.pformat())

(This is not what I used to measure performance, to measure read time I just started a time sphinx-build and killed it after reading, before writing...)

JulienPalard · 2022-10-07T08:36:08Z

I'm closing this issue for the moment.

JulienPalard closed this as completed Oct 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we implement a rst parser? #42

Should we implement a rst parser? #42

JulienPalard commented Sep 13, 2022

ezio-melotti commented Sep 16, 2022 •

edited

Loading

JulienPalard commented Sep 26, 2022

JulienPalard commented Sep 26, 2022

ezio-melotti commented Sep 26, 2022

AA-Turner commented Sep 27, 2022

JulienPalard commented Sep 27, 2022

JulienPalard commented Oct 7, 2022

Should we implement a rst parser? #42

Should we implement a rst parser? #42

Comments

JulienPalard commented Sep 13, 2022

ezio-melotti commented Sep 16, 2022 • edited Loading

JulienPalard commented Sep 26, 2022

JulienPalard commented Sep 26, 2022

ezio-melotti commented Sep 26, 2022

AA-Turner commented Sep 27, 2022

JulienPalard commented Sep 27, 2022

JulienPalard commented Oct 7, 2022

ezio-melotti commented Sep 16, 2022 •

edited

Loading