-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we implement a rst parser? #42
Comments
With "implement" do you mean write from scratch, or adopt an existing parser? A parser would also be useful for the following issues: |
Notes on the parser idea: Writing a reStructuredText parser is not trivial as the directive sometime let them body be parsed (what they call This make for some ambiguous situations, like: .. danger::
This is an error*
.. testsetup::
from this_is_not_an_error import *
A dumb parser can't tell if the content of directives should be parsed or not. Stopping at "it's a directive with a name, arguments, and text content" is not OK as some directives hold tons of valid rst lines, think of the .. class:: Parameter(name, kind, *, default=Parameter.empty, annotation=Parameter.empty)
Parameter objects are *immutable*. Instead of modifying a Parameter object,
you can use :meth:`Parameter.replace` to create a modified copy.
.. attribute:: Parameter.name
The name of the parameter as a string. The name must be a valid
Python identifier.
.. impl-detail::
CPython generates implicit parameter names of the form ``.0`` on the
code objects used to implement comprehensions and generator
expressions.
.. versionchanged:: 3.6
These parameter names are exposed by this module as names like
``implicit0``. and trying to parse the content of all directives is not OK neither as it would quickly lead to rst syntax errors being reported in directives expected to contain something that is not rst. |
Also (on my machine) sphinx-lint on cpython takes 1.6s while parsing using docutils/Sphinx takes 40s. |
If speed is a concern, the use of the parser could be optional, even though that might lead to some duplication between regex-based checkers and parser-based ones. |
@JulienPalard can you share a test script? A medium/long term goal of mine is speeding up Docutils (and Sphinx resultingly) so benchmarks would be useful. A |
Don't hope for something pretty, I was playing with the parser using this:import sys
from collections import defaultdict
from pathlib import Path
import json
import docutils.parsers.rst
class LyingDefaultDict(defaultdict):
def __contains__(self, key):
return True
def __delitem__(self, key):
try:
super().__delitem__(key)
except KeyError:
pass
class DummyDirective(docutils.parsers.rst.Directive):
has_content = True
def run(self):
node = docutils.nodes.raw(
"",
json.dumps(
{
"type": "directive",
"name": self.name,
"arguments": self.arguments,
"options": self.options,
"content": list(self.content),
"lineno": self.lineno,
"content_offset": self.content_offset,
"block_text": self.block_text,
}, indent=4
),
)
try:
self.state.nested_parse(self.content, self.content_offset, node)
except docutils.utils.SystemMessage:
pass
return [node]
def dummy_role(name, rawtext, text, lineno, inliner, options=None, content=None):
return [
docutils.nodes.raw(
"",
json.dumps(
{
"type": "role",
"name": name,
"rawtext": rawtext,
"text": text,
"lineno": lineno,
"options": options,
"content": content,
}, indent=4
),
)
], []
# Tell docutils that all roles are my dummy role
docutils.parsers.rst.roles._roles = LyingDefaultDict(lambda: dummy_role)
# Tell docutils that all directives are my dummy directive
docutils.parsers.rst.directives._directives = LyingDefaultDict(lambda: DummyDirective)
parser = docutils.parsers.rst.Parser()
settings = docutils.frontend.OptionParser(
components=(docutils.parsers.rst.Parser,)
).get_default_values()
source = sys.argv[1]
input = Path(source).read_text(encoding="UTF-8")
document = docutils.utils.new_document(source, settings=settings)
parser.parse(input, document)
print(document.pformat()) (This is not what I used to measure performance, to measure read time I just started a |
I'm closing this issue for the moment. |
Instead of working with regexes.
In one hand I don't think so, because:
But in the other hand, having a proper AST to match invalid usage could be better than relying on regex, and a good parser may also be good at reporting invalid syntax.
The text was updated successfully, but these errors were encountered: