feat!: enhance multiline (expr) parsing #35

milisims · 2022-08-16T15:28:08Z

This PR will flatten (expr) in paragraph, contents, description & add (nl) nodes.

(expr) nodes, which previously contained a sequence of anonymous "str", "num", and "sym" nodes, are replaced with a corresponding sequence of (str), (num), and (sym) nodes. In cases where there's one (expr) (like block names, properties, directive names, etc.) the (expr) node still exists (but will contain named nodes instead of anonymous nodes). For example, a block starting with #+begin_ab3 is parsed as (expr (str) (num)).
In (paragraph), (item), (fndef (definition)), there's now just a sequence of (str), (num), (sym), and (nl) nodes. Well, no (nl)s in (item).

Note that (sym ":") still works for ascii symbols like expr previously did, so we don't need to check for ascii symbols explicitly in a predicate. This makes querying for those symbols quite fast, since they're part of the AST and don't require a predicate to check.

In (paragraph), (fndef (description)), (contents) which is in drawer, block, dynamic_block, and latex_env, newlines are now given a node: (nl)

This will resolve #31 and #26 by enabling queries for single line items:
Fixed width area:

(paragraph . (sym ":") @fixed_width_start [(str)(num)(sym)]* @fixed_width_text (nl)) ; matches first line
(paragraph (nl) (sym ":") @fixed_width_start [(str)(num)(sym)]* @fixed_width_text (nl)) ; matches every other line

I tried some combinations of anchors and I couldn't get this down to one pattern to match the first + every other line.

For #31, sexp diary entries will require a predicate / some effort if you want to support multiline expressions as emacs' orgmode does, but single line support is straightforward as above. For the multiline version, lua-match? with something like %b() would be helpful, if you're using neovim.

kristijanhusak · 2023-04-15T18:08:08Z

@milisims what's the state of this PR? Should I maybe give it a test or it's still WIP?

milisims · 2023-04-15T18:58:00Z

The flattening of (expr) into (sym) (str) and (num) is definitely concrete and staying, and (sym) has the anonymous nodes for ascii symbols still, so those changes will be stable.

For the other change, my goal is to facilitate querying ambiguous markup. What I did here was add the "pre" "mid" and "post" signals (empty nodes) as a part of (sym). Basically, if the symbol is immediately before, in between, or after alnum characters, the relevant anonymous node is shown. So a bold query can be simplified to (paragraph (sym "*" "pre") @start (sym "*" "post") @stop).

However, right now /*a*/ is parsed as (sym "/").(sym "*" "pre").(str).(sym "*" "post").(sym "/"), which has pros and cons. This lets the user be explicit easily about allowing or not the use of double markup like that, but if they do want double markup then it renders the "pre" and "post" kind of useless. Additionally, it's already not helping parsing objects like links, because the double symbols.

So, there's three solutions.

Drop the pre/mid/post signals, or leave it as is
Propagate the signals: /*a*/ is parsed as (sym "/" "pre").(sym "*" "pre").(str).(sym "*" "post").(sym "/" "post")
Add a pre/post field (sym "char" pre: "str/sym/num", post: "str/sym/num")

Since it's optional and very cheap to have the signals, I'm not really interested in dropping them. The second is a barely more expensive, but I think it's simplest to use/understand. 3 gives the most control, but take for example: a]/*a*/. Option 2 says the symbols in the middle are "mid", option 3 requires additional processing to figure it out. I like the idea of some combination of propagating signals and using a pre/post field, because they can be negated usefully in a query.

Currently I'm just writing some play queries to figure out a scheme that makes the most sense, if there is one. Have any thoughts?

kristijanhusak · 2023-04-15T20:23:27Z

From the given options, I think I would prefer option 2, but option 3 would also be helpful. My current implementation works, but it's far from perfect and has a bunch of bugs and edge cases. I'm looking for a way to simplify it, and both of these options should help, but option 2 seems like I might be able to drop custom parsing completely.

Whichever way you choose, even the currently implemented one, should be helpful to some extent.

kristijanhusak · 2023-05-28T18:40:08Z

I started preparing a branch with these changes, and I ran into one issue with tags parsing.
This content:

* Test :tag:
  - Test

Is parsed like this:

(document [0, 0] - [2, 0]
  subsection: (section [0, 0] - [2, 0]
    headline: (headline [0, 0] - [1, 0]
      stars: (stars [0, 0] - [0, 1])
      item: (item [0, 2] - [0, 12]
        (str [0, 2] - [0, 6])
        (sym [0, 7] - [0, 8])
        (str [0, 8] - [0, 11])
        (sym [0, 11] - [0, 12])))
    body: (body [1, 0] - [2, 0]
      (list [1, 0] - [2, 0]
        (listitem [1, 2] - [2, 0]
          bullet: (bullet [1, 2] - [1, 3])
          contents: (paragraph [1, 4] - [2, 0]
            (str [1, 4] - [1, 8])
            (nl [1, 8] - [2, 0])))))))

But if it's a single line:

* Test :tag:

It parses it properly:

(document [0, 0] - [1, 0]
  subsection: (section [0, 0] - [1, 0]
    headline: (headline [0, 0] - [1, 0]
      stars: (stars [0, 0] - [0, 1])
      item: (item [0, 2] - [0, 6]
        (str [0, 2] - [0, 6]))
      tags: (tag_list [0, 6] - [0, 12]
        tag: (tag [0, 8] - [0, 11])))))

Generally, it seems that adding a second line with any type of content immediately breaks parsing the tags.

Previously, (expr) had anonymous "str" "num" and "sym" nodes. Those are now exposed. (sym) nodes retain the anonymous symbols, like (sym "*"). Additionally, (sym next: "str") indicates the symbol is before an immediate (str), and (sym prev: "num") indicates the symbol is after a number. Add (nl) in multiline text: - (paragraph) - (fndef (description)) - (contents), in drawers, blocks, dynamic blocks, and latex_envs Add "sub" and "final" fields to (stars)

kristijanhusak · 2023-07-20T12:49:26Z

Tags issue is fixed, thanks!
I ran into another one with checkboxes and links:

This content:

- [[Test]]

Generates this tree:

(document [0, 0] - [1, 0]
  body: (body [0, 0] - [1, 0]
    (list [0, 0] - [1, 0]
      (listitem [0, 0] - [1, 0]
        bullet: (bullet [0, 0] - [0, 1])
        checkbox: (checkbox [0, 1] - [0, 9]
          status: (status [0, 3] - [0, 8]))
        contents: (paragraph [0, 9] - [1, 0]
          (sym [0, 9] - [0, 10])
          (nl [0, 10] - [1, 0]))))))

It treats the link as a checkbox

milisims force-pushed the flatten-expr branch from 9e401aa to 5b16421 Compare August 16, 2022 15:57

milisims force-pushed the flatten-expr branch from 5b16421 to 9f22148 Compare September 3, 2022 21:20

milisims force-pushed the flatten-expr branch from 88771a0 to 2fdfbb7 Compare September 12, 2022 16:19

milisims force-pushed the flatten-expr branch from 2fdfbb7 to 146d75c Compare October 27, 2022 20:24

milisims force-pushed the flatten-expr branch from 146d75c to 5bc1f6e Compare December 3, 2022 22:44

milisims force-pushed the flatten-expr branch from 5bc1f6e to fa0db33 Compare April 15, 2023 18:04

milisims force-pushed the flatten-expr branch from fa0db33 to e538c2b Compare June 19, 2023 22:13

kristijanhusak mentioned this pull request Feb 11, 2024

fix: Link treated as checkbox #45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: enhance multiline (expr) parsing #35

feat!: enhance multiline (expr) parsing #35

milisims commented Aug 16, 2022

kristijanhusak commented Apr 15, 2023

milisims commented Apr 15, 2023 •

edited

Loading

kristijanhusak commented Apr 15, 2023

kristijanhusak commented May 28, 2023

kristijanhusak commented Jul 20, 2023

feat!: enhance multiline (expr) parsing #35

Are you sure you want to change the base?

feat!: enhance multiline (expr) parsing #35

Conversation

milisims commented Aug 16, 2022

kristijanhusak commented Apr 15, 2023

milisims commented Apr 15, 2023 • edited Loading

kristijanhusak commented Apr 15, 2023

kristijanhusak commented May 28, 2023

kristijanhusak commented Jul 20, 2023

milisims commented Apr 15, 2023 •

edited

Loading