Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: enhance multiline (expr) parsing #35

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

milisims
Copy link
Owner

This PR will flatten (expr) in paragraph, contents, description & add (nl) nodes.

(expr) nodes, which previously contained a sequence of anonymous "str", "num", and "sym" nodes, are replaced with a corresponding sequence of (str), (num), and (sym) nodes. In cases where there's one (expr) (like block names, properties, directive names, etc.) the (expr) node still exists (but will contain named nodes instead of anonymous nodes). For example, a block starting with #+begin_ab3 is parsed as (expr (str) (num)).
In (paragraph), (item), (fndef (definition)), there's now just a sequence of (str), (num), (sym), and (nl) nodes. Well, no (nl)s in (item).

Note that (sym ":") still works for ascii symbols like expr previously did, so we don't need to check for ascii symbols explicitly in a predicate. This makes querying for those symbols quite fast, since they're part of the AST and don't require a predicate to check.

In (paragraph), (fndef (description)), (contents) which is in drawer, block, dynamic_block, and latex_env, newlines are now given a node: (nl)

This will resolve #31 and #26 by enabling queries for single line items:
Fixed width area:

(paragraph . (sym ":") @fixed_width_start [(str)(num)(sym)]* @fixed_width_text (nl)) ; matches first line
(paragraph (nl) (sym ":") @fixed_width_start [(str)(num)(sym)]* @fixed_width_text (nl)) ; matches every other line

I tried some combinations of anchors and I couldn't get this down to one pattern to match the first + every other line.

For #31, sexp diary entries will require a predicate / some effort if you want to support multiline expressions as emacs' orgmode does, but single line support is straightforward as above. For the multiline version, lua-match? with something like %b() would be helpful, if you're using neovim.

@kristijanhusak
Copy link
Contributor

@milisims what's the state of this PR? Should I maybe give it a test or it's still WIP?

@milisims
Copy link
Owner Author

milisims commented Apr 15, 2023

The flattening of (expr) into (sym) (str) and (num) is definitely concrete and staying, and (sym) has the anonymous nodes for ascii symbols still, so those changes will be stable.

For the other change, my goal is to facilitate querying ambiguous markup. What I did here was add the "pre" "mid" and "post" signals (empty nodes) as a part of (sym). Basically, if the symbol is immediately before, in between, or after alnum characters, the relevant anonymous node is shown. So a bold query can be simplified to (paragraph (sym "*" "pre") @start (sym "*" "post") @stop).

However, right now /*a*/ is parsed as (sym "/").(sym "*" "pre").(str).(sym "*" "post").(sym "/"), which has pros and cons. This lets the user be explicit easily about allowing or not the use of double markup like that, but if they do want double markup then it renders the "pre" and "post" kind of useless. Additionally, it's already not helping parsing objects like links, because the double symbols.

So, there's three solutions.

  1. Drop the pre/mid/post signals, or leave it as is
  2. Propagate the signals: /*a*/ is parsed as (sym "/" "pre").(sym "*" "pre").(str).(sym "*" "post").(sym "/" "post")
  3. Add a pre/post field (sym "char" pre: "str/sym/num", post: "str/sym/num")

Since it's optional and very cheap to have the signals, I'm not really interested in dropping them. The second is a barely more expensive, but I think it's simplest to use/understand. 3 gives the most control, but take for example: a]/*a*/. Option 2 says the symbols in the middle are "mid", option 3 requires additional processing to figure it out. I like the idea of some combination of propagating signals and using a pre/post field, because they can be negated usefully in a query.

Currently I'm just writing some play queries to figure out a scheme that makes the most sense, if there is one. Have any thoughts?

@kristijanhusak
Copy link
Contributor

From the given options, I think I would prefer option 2, but option 3 would also be helpful. My current implementation works, but it's far from perfect and has a bunch of bugs and edge cases. I'm looking for a way to simplify it, and both of these options should help, but option 2 seems like I might be able to drop custom parsing completely.

Whichever way you choose, even the currently implemented one, should be helpful to some extent.

@kristijanhusak
Copy link
Contributor

I started preparing a branch with these changes, and I ran into one issue with tags parsing.
This content:

* Test :tag:
  - Test

Is parsed like this:

(document [0, 0] - [2, 0]
  subsection: (section [0, 0] - [2, 0]
    headline: (headline [0, 0] - [1, 0]
      stars: (stars [0, 0] - [0, 1])
      item: (item [0, 2] - [0, 12]
        (str [0, 2] - [0, 6])
        (sym [0, 7] - [0, 8])
        (str [0, 8] - [0, 11])
        (sym [0, 11] - [0, 12])))
    body: (body [1, 0] - [2, 0]
      (list [1, 0] - [2, 0]
        (listitem [1, 2] - [2, 0]
          bullet: (bullet [1, 2] - [1, 3])
          contents: (paragraph [1, 4] - [2, 0]
            (str [1, 4] - [1, 8])
            (nl [1, 8] - [2, 0])))))))

But if it's a single line:

* Test :tag:

It parses it properly:

(document [0, 0] - [1, 0]
  subsection: (section [0, 0] - [1, 0]
    headline: (headline [0, 0] - [1, 0]
      stars: (stars [0, 0] - [0, 1])
      item: (item [0, 2] - [0, 6]
        (str [0, 2] - [0, 6]))
      tags: (tag_list [0, 6] - [0, 12]
        tag: (tag [0, 8] - [0, 11])))))

Generally, it seems that adding a second line with any type of content immediately breaks parsing the tags.

Previously, (expr) had anonymous "str" "num" and "sym" nodes. Those are
now exposed. (sym) nodes retain the anonymous symbols, like (sym "*").
Additionally, (sym next: "str") indicates the symbol is before an immediate
(str), and (sym prev: "num") indicates the symbol is after a number.

Add (nl) in multiline text:
  - (paragraph)
  - (fndef (description))
  - (contents), in drawers, blocks, dynamic blocks, and latex_envs

Add "sub" and "final" fields to (stars)
@kristijanhusak
Copy link
Contributor

Tags issue is fixed, thanks!
I ran into another one with checkboxes and links:

This content:

- [[Test]]

Generates this tree:

(document [0, 0] - [1, 0]
  body: (body [0, 0] - [1, 0]
    (list [0, 0] - [1, 0]
      (listitem [0, 0] - [1, 0]
        bullet: (bullet [0, 0] - [0, 1])
        checkbox: (checkbox [0, 1] - [0, 9]
          status: (status [0, 3] - [0, 8]))
        contents: (paragraph [0, 9] - [1, 0]
          (sym [0, 9] - [0, 10])
          (nl [0, 10] - [1, 0]))))))

It treats the link as a checkbox

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sexp diary entries support
2 participants