Add prefix of the `part-end` delimiter to the `part` #387

Leonidas-from-XIV · 2022-07-27T15:22:09Z

#374 reports that the last line gets cut off when using (* $MDX part-end *). This is generally true because Part.Parse_parts.parse_line can only parse a line into a Part_end or a Normal, but a line with an end-marker is sort of both.

This PR adds the prefix of the line to the Part_end constructor so if there is any code that is not whitespace, it will get added to the constructor from where the code can attach the missing line to the part.

I've been thinking about the design of this solution because attaching the value to the Part_end constructor feels wrong since Ocaml_delimiter only deals with delimiters and not the payload of the block and this is changing it. I feel that maybe the parsing ought to be done differently and signal that there is something else on the line that still needs to be processed so it would get processed into something more like [Normal prefix; Part_end].

As such, I welcome other suggestions how to solve this in a "nice" way, without breaking the separation between the parsing and the delimiters in ugly ways.

Fixes #374.

NathanReb · 2022-08-01T13:33:47Z

Ocaml_delimiter is only used by Parts so removing it entirely would not be too bad if we test the latter properly.

Parts is in a pretty sad state from the look of it so we could cease this as an occasion to improve it a little bit.

Another question I'm asking myself is whether this is the right way to fix this issue. It seems that enforcing delimiters sit on their own line is sensible. Maybe we should start a conversation with the ocamlformat team to try to see how we could come to an agreement on this.

If there is nothing to be done on their side we can revert to allow inline part, or at least end delimiters I guess. The reason I'd no be too kin to support it is that I think it can lead to pretty ambiguous cases, especially if in the example from the linked issue, the end delimiter was directly followed by a begin one. Inline begin delimiters can be tedious to deal with, especially regarding indentation.

Leonidas-from-XIV · 2022-08-01T13:57:23Z

I agree that the current implementation is not great, but I think it is good as a starting point for discussion. I think we agree that the current way, where it silently drops the last line is bad - even a version where it would error out would be better because it wouldn't lead to potentially subtle errors where the inlined code is simply missing lines but it is only noticed way too late. Throwing an error in that case would be pretty simple and require fewer changes than what's in this branch already.

As a matter of fact, we weren't aware of the issue until the bugreport either, so it shows how fragile the current baheviour is.

Making OCamlformat not put the delimiters in-line and emitting an error message if people do that by hand could be a decent compromise, because as you say, the semantics of in-line delimiters are a bit complicated.

If we do decide to go forward with allowing inline end-delimiters, I'll probably drop Ocaml_delimiter because the more I think about it the less sense it makes to split the code this way.

NathanReb · 2022-08-02T08:09:31Z

Oh no don't get me wrong, there's no particular problem with the implementation. I'm merely wondering what the best fix would be here!

NathanReb

Looks good to me!

I have a few minor comments but hen we should be good. I think some follow up refactoring of Parts could help improve the code overall but we can do this a bit later, in particular, we could do it once we drop the attributes based delimiters support!

lib/part.mli

NathanReb · 2022-08-04T09:56:54Z

lib/part.ml

-        let msg = Printf.sprintf "File ended before part %s." p in
-        Error (msg, nline)
-    | File_end, None -> Ok [ make_part ~is_begin_end_part:true part_lines ]
+    match input_line_err input with


I think I find the new implementation, with nested pattern matching, a bit harder to follow.

Maybe we could extract a separate, mutually recursive function to handle the Ok _ case.

What do you think?

I find it a bit easier to follow because it really had two concerns:

EOF handling

Parsing

File_end was not a result of parsing, it is just a signal to the parser to stop. But admittedly, the function is quite complex and nested (and it took me a while to understand), so you correctly point out that that it still conflates parsing and IO. It's a good point, let me try to improve it.

I rewrote parse_parts further:

Instead of reading from the channel, I read into a list, thus the pattern match on the End_of_file marker became a pattern match on [], that is, the end of the input. This of course means it still has to deal with these two cases.

Then I realized that the function doesn't need to be recursive. What was making the function hard to read for me was the recursive calls with >>| to cons things to the list. I noted that fold_left can do the same thing:

Supply an element to work on and terminate when done

Deal with the state in the accumulator
Thus thus the function could be split into "deal with the inputs" via fold_left and "deal with the tail" based on the result of the accumulator. Given our minimum version is 4.08, let* helps to unwrap the arguments.

Finally, given I changed from a version that lazily reads out of an in_channel to a version that eagerly reads the input into a list, I thought I can put the laziness back in by switching from List.fold_left to Seq.fold_left which was indeed a drop-in replacement.

I believe the resulting function is a bit easier to understand given each piece of code only deals with one aspect and the fold is a bit simpler to follow through than the recursion. Fortunately the tests are pretty comprehensive so I immediately got feedback when some part of the logic was implemented incorrectly.

Leonidas-from-XIV · 2022-08-10T14:39:42Z

I've been iterating on and off on this, thinking of smaller improvements that lead to other improvements, that again lead to other refactorings. As such the removal of the File_end constructor from Parts_part.t and the addition of Content in Ocaml_delimiter.t has shown that these types are identical, just spelled somewhat differently. Thus this made me unify the types, at which point it seemed to me that it would be most sensible to just join these two modules.

I've reworked the test a bit: the test cases of Parts_part and of Ocaml_delimiter were mostly similar except the latter tested the error cases as well. I've exposed a function that returns the result type and added these test cases to the Parts test as well.

I think this is overall an improvement, because the tests tested very similar functionality, if the Ocaml_delimiter test would fail, the Parts test with the same input would fail in the same way which did not add much value to the tests. Unifying this is in my opinion a bit easier to understand and also solves the design problem that lines previously could only be a separator or content.

This is a test case for #374 to show that there is an issue with parts of code missing.

Fixes #374

While there could be a case for modelling it this way, it makes the code tricker to understand and `File_end` is, after all, not a valid parsing result either way. Having the parsing separated makes the function a bit more nested but it becomes easier to understand.

Having them separately makes no sense since one is just a reformulation of the other.

gpetiot · 2022-09-21T08:45:00Z

test/lib/test_part.ml

+        (Error
+           (`Msg
+             "'part-end' delimiter does not accept a value. Please write '(* \
+              $MDX part-end *)' instead."));


Not important but since this part of the API is now internal, wouldn't it better to test it with ppx_expect instead ? So we can not export this module anymore.

That's well possible, but it would introduce an additional dependency since we don't depend on ppx_expect at the moment. While this could certainly be done in the future, by refactoring (all the relevant) tests, I wouldn't add it to this PR because that would be scope-creep.

gpetiot

LGTM, this is a clear improvement indeed. Maybe in the future it would be neat to support multiline comments such as

(* $MDX
       part-end *)

I tend to prefer ocamllex over regexps so I'm a bit biased. Maybe not having to maintain the token flow line-by-line would make it easier to parse as well? But mdx is already designed this way in its current state so this can wait.

Leonidas-from-XIV · 2022-09-21T09:04:46Z

I would generally agree on ocamllex. However I kept the parsers as regexp for now since one of them is deprecated (and we should try to remove it for MDX 3, #396), at which point it should be easier to switch the other one to ocamllex instead of porting both now and deleting one later or having a mix of two different ways of parsing.

@NathanReb

CHANGES: #### Added - Report all parsing errors in Markdown files (realworldocaml/mdx#389, @NathanReb) #### Changed - Preserve indentation in multiline OCaml blocks in .mli files (realworldocaml/mdx#395, @panglesd) #### Fixed - Fixed compatibility with Cmdliner 1.1.0 (realworldocaml/mdx#371, @Leonidas-from-XIV) - Report errors and exit codes of toplevel directives (realworldocaml/mdx#382, @talex5, @Leonidas-from-XIV) - Fix block locations in error reporting (realworldocaml/mdx#389, @NathanReb) - Include the content of the line that features the `part-end` MDX directive in the output, before that line would've been dropped (realworldocaml/mdx#374, realworldocaml/mdx#387, @Leonidas-from-XIV) - Handle EINTR signal on waitpid call by restarting the syscall. (realworldocaml/mdx#409, @tmcgilchrist) - Fix parsing of multiline toplevel phrases in .mli files (realworldocaml/mdx#394, realworldocaml/mdx#397, @Leonidas-from-XIV) #### Removed - Removed warning about missing semicolons added in MDX 1.11.0 and the automatic insertion of semicolons in the corrected files introduced in MDX 2.0.0. (realworldocaml/mdx#398, @Leonidas-from-XIV)

Leonidas-from-XIV requested a review from NathanReb July 27, 2022 15:22

NathanReb reviewed Aug 4, 2022

View reviewed changes

Leonidas-from-XIV requested a review from NathanReb August 10, 2022 14:30

Leonidas-from-XIV added 19 commits September 1, 2022 16:56

Add test for missing line when including file parts

3a8eab9

This is a test case for #374 to show that there is an issue with parts of code missing.

Use same ws definition on both branches

a32356c

Simplify parsing of lines

8596a54

Add the prefix of a Cmt part-end directive to the part

5b12de0

Fixes #374

Add to Changelog

c72f316

Simplify type names and testables

f057db3

Rename constructor

6509d4b

Push parsing into reading from IO

7e3b133

Allow parsing more than one item per line

bb744a8

Rework recursive function into a fold

680c912

Switch from list to a lazy seq

471d8ee

Add implementation of Seq.append for OCaml < 4.11

ebbba92

Use named struct in internal functions

deb30e2

Unify Part.t and Ocaml_delimiter.t types

ddac131

Having them separately makes no sense since one is just a reformulation of the other.

Remove type duplication

a9ebfca

Merge Ocaml_delimiter into Parts

4f7c2e6

Use Testable.errormsg

f453fc9

Remove unused testing function

fc7bf20

Leonidas-from-XIV requested a review from gpetiot September 20, 2022 13:51

gpetiot reviewed Sep 21, 2022

View reviewed changes

gpetiot approved these changes Sep 21, 2022

View reviewed changes

Leonidas-from-XIV merged commit cb9c545 into realworldocaml:main Sep 21, 2022

Leonidas-from-XIV deleted the mdx-markers branch September 21, 2022 09:37

dwang20151005 mentioned this pull request Oct 25, 2022

support parts from more file formats #402

Open

Leonidas-from-XIV mentioned this pull request Jan 6, 2023

[new release] mdx (2.2.0) ocaml/opam-repository#22867

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prefix of the `part-end` delimiter to the `part` #387

Add prefix of the `part-end` delimiter to the `part` #387

Leonidas-from-XIV commented Jul 27, 2022

NathanReb commented Aug 1, 2022

Leonidas-from-XIV commented Aug 1, 2022

NathanReb commented Aug 2, 2022

NathanReb left a comment

NathanReb Aug 4, 2022

Leonidas-from-XIV Aug 4, 2022

Leonidas-from-XIV Aug 5, 2022

Leonidas-from-XIV commented Aug 10, 2022

gpetiot Sep 21, 2022

Leonidas-from-XIV Sep 21, 2022

gpetiot left a comment

Leonidas-from-XIV commented Sep 21, 2022

Add prefix of the part-end delimiter to the part #387

Add prefix of the part-end delimiter to the part #387

Conversation

Leonidas-from-XIV commented Jul 27, 2022

NathanReb commented Aug 1, 2022

Leonidas-from-XIV commented Aug 1, 2022

NathanReb commented Aug 2, 2022

NathanReb left a comment

Choose a reason for hiding this comment

NathanReb Aug 4, 2022

Choose a reason for hiding this comment

Leonidas-from-XIV Aug 4, 2022

Choose a reason for hiding this comment

Leonidas-from-XIV Aug 5, 2022

Choose a reason for hiding this comment

Leonidas-from-XIV commented Aug 10, 2022

gpetiot Sep 21, 2022

Choose a reason for hiding this comment

Leonidas-from-XIV Sep 21, 2022

Choose a reason for hiding this comment

gpetiot left a comment

Choose a reason for hiding this comment

Leonidas-from-XIV commented Sep 21, 2022

Add prefix of the `part-end` delimiter to the `part` #387

Add prefix of the `part-end` delimiter to the `part` #387