Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we implement a rst parser? #42

Closed
JulienPalard opened this issue Sep 13, 2022 · 7 comments
Closed

Should we implement a rst parser? #42

JulienPalard opened this issue Sep 13, 2022 · 7 comments

Comments

@JulienPalard
Copy link
Collaborator

Instead of working with regexes.

In one hand I don't think so, because:

  • Some cases detected by sphinx-lint are valid syntax, but invalid usage.
  • Some cases are invalid syntax.

But in the other hand, having a proper AST to match invalid usage could be better than relying on regex, and a good parser may also be good at reporting invalid syntax.

@ezio-melotti
Copy link
Collaborator

ezio-melotti commented Sep 16, 2022

With "implement" do you mean write from scratch, or adopt an existing parser?
IMHO both a parser and regex can coexist. For example, you can use the parser to parse the elements of a list, and then a regex to find broken roles within each individual element. Depending on how you implement it though, it might become complex.
Also speed was one of your initial goals IIRC, do you know how much a parser is going to affect performances?

A parser would also be useful for the following issues:

@JulienPalard
Copy link
Collaborator Author

Notes on the parser idea:

Writing a reStructuredText parser is not trivial as the directive sometime let them body be parsed (what they call nested_parse), some don't. And projects are allowed to extend the rst syntax by adding their own directive (that may, or may not, allow rst as their body).

This make for some ambiguous situations, like:

.. danger::

    This is an error*


.. testsetup::

    from this_is_not_an_error import *

A dumb parser can't tell if the content of directives should be parsed or not. Stopping at "it's a directive with a name, arguments, and text content" is not OK as some directives hold tons of valid rst lines, think of the class Python directive containing an attribute directive containing an impl-detail directive containing a verisonchanged directive like:

.. class:: Parameter(name, kind, *, default=Parameter.empty, annotation=Parameter.empty)

   Parameter objects are *immutable*.  Instead of modifying a Parameter object,
   you can use :meth:`Parameter.replace` to create a modified copy.

   .. attribute:: Parameter.name

      The name of the parameter as a string.  The name must be a valid
      Python identifier.

      .. impl-detail::

         CPython generates implicit parameter names of the form ``.0`` on the
         code objects used to implement comprehensions and generator
         expressions.

         .. versionchanged:: 3.6
            These parameter names are exposed by this module as names like
            ``implicit0``.

and trying to parse the content of all directives is not OK neither as it would quickly lead to rst syntax errors being reported in directives expected to contain something that is not rst.

@JulienPalard
Copy link
Collaborator Author

Also (on my machine) sphinx-lint on cpython takes 1.6s while parsing using docutils/Sphinx takes 40s.

@ezio-melotti
Copy link
Collaborator

If speed is a concern, the use of the parser could be optional, even though that might lead to some duplication between regex-based checkers and parser-based ones.

@AA-Turner
Copy link
Member

Also (on my machine) sphinx-lint on cpython takes 1.6s while parsing using docutils/Sphinx takes 40s.

@JulienPalard can you share a test script? A medium/long term goal of mine is speeding up Docutils (and Sphinx resultingly) so benchmarks would be useful.

A

@JulienPalard
Copy link
Collaborator Author

Don't hope for something pretty, I was playing with the parser using this:
import sys
from collections import defaultdict
from pathlib import Path
import json
import docutils.parsers.rst


class LyingDefaultDict(defaultdict):
    def __contains__(self, key):
        return True

    def __delitem__(self, key):
        try:
            super().__delitem__(key)
        except KeyError:
            pass


class DummyDirective(docutils.parsers.rst.Directive):
    has_content = True

    def run(self):
        node = docutils.nodes.raw(
            "",
            json.dumps(
                {
                    "type": "directive",
                    "name": self.name,
                    "arguments": self.arguments,
                    "options": self.options,
                    "content": list(self.content),
                    "lineno": self.lineno,
                    "content_offset": self.content_offset,
                    "block_text": self.block_text,
                }, indent=4
            ),
        )
        try:
            self.state.nested_parse(self.content, self.content_offset, node)
        except docutils.utils.SystemMessage:
            pass
        return [node]


def dummy_role(name, rawtext, text, lineno, inliner, options=None, content=None):
    return [
        docutils.nodes.raw(
            "",
            json.dumps(
                {
                    "type": "role",
                    "name": name,
                    "rawtext": rawtext,
                    "text": text,
                    "lineno": lineno,
                    "options": options,
                    "content": content,
                }, indent=4
            ),
        )
    ], []



# Tell docutils that all roles are my dummy role
docutils.parsers.rst.roles._roles = LyingDefaultDict(lambda: dummy_role)

# Tell docutils that all directives are my dummy directive
docutils.parsers.rst.directives._directives = LyingDefaultDict(lambda: DummyDirective)

parser = docutils.parsers.rst.Parser()
settings = docutils.frontend.OptionParser(
    components=(docutils.parsers.rst.Parser,)
).get_default_values()

source = sys.argv[1]
input = Path(source).read_text(encoding="UTF-8")
document = docutils.utils.new_document(source, settings=settings)
parser.parse(input, document)
print(document.pformat())

(This is not what I used to measure performance, to measure read time I just started a time sphinx-build and killed it after reading, before writing...)

@JulienPalard
Copy link
Collaborator Author

I'm closing this issue for the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants