Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCaml formatting #16

Closed
torhovland opened this issue Aug 31, 2022 · 28 comments · Fixed by #53
Closed

OCaml formatting #16

torhovland opened this issue Aug 31, 2022 · 28 comments · Fixed by #53
Assignees

Comments

@torhovland
Copy link
Member

Look at how these do it:

https://github.com/OCamlPro/ocp-indent
https://github.com/ocaml-ppx/ocamlformat

@aspiwack
Copy link
Member

Cc @Niols

@aspiwack
Copy link
Member

I want to clarify that it is not an objective to match the result of either of these tools (not that it's a bad idea to look at these tools, I just wanted to prevent ambiguities).

@Niols
Copy link
Member

Niols commented Aug 31, 2022

I believe it would be a great and exciting idea to target OCaml in this project:

  • It is fairly popular within Tweag. I have no idea about actual statistics but there seem to be quite a few people using it personally or professionally.

  • In my opinion, it does not enjoy yet a good formatter (cf below for my opinion of OCamlFormat).

  • Compared to Haskell, it is, I think, easier to parse and format. In particular, the language extensions are much more limited in what they can do to OCaml's syntax.

  • In its entirety, it is quite big, but it is possible to only support the “core” language in a first step and extend later to PPXes,
    object-oriented programming, etc.

I had started writing a few considerations about formatters and about what I'd expect from an OCaml formatter in particular (I wanted to start writing an “Ormolu-for-OCaml”, but this pretty much gets superseeded by tree-sitter-formatter). This is not exactly the place where to discuss this, but it might help define what we want to expect (or not) from tree-sitter-formatter's OCaml query file. I'll first share an opinion on what makes a “good” formatter for a rich language like OCaml (but this applies, in my opinion, to many others). After that, I will share a few random notes on things to pay attention to.

I can't wait to see how this project unfolds!

OCamlFormat vs. Ormolu

In my (very personal) opinion, formatters should be opinionated but they should take into account the choices the user made in their input. Basically, I am saying that Ormolu is getting right what OCamlFormat is getting wrong. I personally have never managed to use OCamlFormat in a satisfactory way and I truly believe that it comes from the two following points:

  • OCamlFormat produces one output, no matter what the input is, and
  • OCamlFormat is highly parameterizable.

My tests of OCamlFormat were therefore always in the lines of:

  • deciding to give OCamlFormat another try,
  • tweaking the format to get something closer to my style of OCaml,
  • realizing that I could not achieve my usual style, in particular because the rules for my style were hard to determine,
  • deciding to give it a go anyways, and
  • eventually giving up, because I could not get used to the format and always wanted to tweak something else.

I haven't done a survey of the OCaml ecosystem, but I know I share this kind of experiences with at least some OCaml programmer friends. In my opinion, Ormolu makes two points that are crucial for a formatter to be actually used (and that OCamlFormat gets wrong):

  1. Ormolu is extremely opinionated, “implementing one ‘true’ formatting style which admits no configuration.” I believe this makes it much easier to decide whether to use it or not, which (maybe counter-intuitively) makes it simpler for users to adopt it. It also avoid phases of “negotiations” where multiple users argue for their style to be chosen.

  2. Ormolu lets “the layout of the input [influence] the layout choices in the output.” I believe that this is also crucial: some rules are simply too hard to write correctly and “get right” every time. Taking the user's input into consideration makes for simpler code in the formatter itself while letting the user decide on a case-by-case basis what to do.

I am a bit afraid that those might be difficult to achieve with tree-sitter-formatter, though:

  • For point 1., tree-sitter-formatter will inevitably be configurable by design. That being said, it also means that maintaining a fork of the formatter (eg. Fourmolu for Ormolu) would only be a matter of maintaining a different query, increasing the chances of widespread use of tree-sitter-formatter. It doesn't mean that tree-sitter-formatter shouldn't ship with only one query file for OCaml.

  • For point 2., this sounds pretty hard to implement in tree-sitter-formatter. It either means to design a generic way of detecting a users' input (Ormolu distinguishes between vertical and horizontal layouts, I believe) or to make the queries richer to the point that they allow expressing that kind of detection, which is probably not worth the trouble.

Random Considerations

Some of these considerations are specific to OCaml formatting while some others are actually quite generic to formatting in general. I'll also add that they also come from my opinion and not an absolute truth.

  • It is my understanding that Tuareg and ocp-indent are two widely-used indenters (in the case of Tuareg, it is also much more) for OCaml. A new formatter for OCaml should probably be compatible with those, that is it should strive to produce an output that is correctly indented for Tuareg and ocp-indent. (The two tools are parameterizable, so maybe just something close to their default configuration.)

  • In general, the formatter should be made accessible easily in all modern editors. In the specific case of OCaml, the formatter should also be made available as a backend to Dune's automatic formatting's functionality.

  • Typical OCaml indenters and formatters handle OCamlLex, OCamlYacc and Menhir poorly. Considering how important these three languages are in the OCaml ecosystem, I believe it is necessary to treat them as well. It could be the “killer feature” that makes this formatter being used instead of others. I have no idea whether Tree Sitter supports those other languages nicely or not.

  • It might be worth considering a formatter for ReasonML although, because the syntax is so radically different, it might not make sense to support both OCaml and ReasonML in the same tool. I also have no experience with ReasonML's syntax and I have no idea whether there already exists good formatters for it.

  • Note that Ormolu follows more principles than the two discussed above, namely:

    • Using the language's parser directly for full compatibility. This is obviously not applicable here because of Tree Sitter.

    • Writing code that is easy to modify and maintain. This should probably apply to any project.

    • Having a formatting style that results in minimal diffs. I am unsure what to think about this point but it is good to keep in mind.

    • Choosing a style that is compatible with modern dialects of the language. This is less applicable to OCaml than to Haskell, but the formatter should handle modern (eg. PPXes) and less widely-used parts of the language (eg. objects).

    • Being idempotent. This only makes sense in conjunction with point 2. above and it is crucial to make it work.

    • Being well-tested and robust. This should probably apply to any project.

@torhovland
Copy link
Member Author

Great input!

We are going to follow Ormolu's principle of letting the user input decide whether a block should be single-line or multi-line, and also of staying opinionated.

It remains to be seen if tree-sitter-formatter will be able to give you satisfactory OCaml formatting, but hopefully the query language is powerful enough to let us.

I'm curious, though, when you say "taking the user's input into consideration", are you just referring to line breaks, or more? Because detecting if the user wants to break a block into multiple lines is indeed within the scope of this project. See #14.

There doesn't seem to be any tree-sitter grammar for lex and yacc, unfortunately, but maybe you should just use tree-sitter instead of those 😛

There is some support for ReasonML, though.

@Niols
Copy link
Member

Niols commented Sep 1, 2022

I'm curious, though, when you say "taking the user's input into consideration", are you just referring to line breaks, or more? Because detecting if the user wants to break a block into multiple lines is indeed within the scope of this project. See #14.

Very nice! I mostly meant that, although I've always been wondering if it wouldn't be worth also detecting users that prefer operators at the beginning or the end of a line, eg.

do_something
|> do_something_else

do_something |>
do_something_else

or stuff like

do_something @@ fun x ->
do_something_else

do_something
@@ fun x -> do_something_else

or in types:

val foo : int ->
          float ->
          bool

val foo : int
       -> float
       -> bool

etc. I am not sure if it is possible to just get it right. Besides, it might depend a lot on the user and even on the operator (I tend to use |> at the beginning on the line, but @@ fun x -> at the end of the line, for instance). I don't know how Ormolu does it. At some point, there has to be a limit where we actually are opinionated, we can't just have a formatter keep the user's input as-is :p)

@Niols
Copy link
Member

Niols commented Sep 1, 2022

There doesn't seem to be any tree-sitter grammar for lex and yacc, unfortunately, but maybe you should just use tree-sitter instead of those stuck_out_tongue

I suppose it would not be the worst! But a lot of people are using those, so I suppose it would make sense to support them (at least in Tree Sitter-based syntax highlighting). I know there's been some work on supporting OCamlLex, OCamlYacc and Menhir in Tree Sitter, but I don't know how stable that is.

@torhovland torhovland self-assigned this Sep 8, 2022
@torhovland
Copy link
Member Author

@Niols How would you expect a formatter to treat this:

type t =
 {mutable buffer : bytes;
  position : int}

Specifically, would this be a good output?

type t = {
    mutable buffer : bytes;
    position : int;
}

I don't even know if that last semicolon is allowed.

@Niols
Copy link
Member

Niols commented Sep 8, 2022

The last semicolon is allowed and I suppose it makes sense to add it as per the rule to avoid diffs. I tend to go for:

type t = {
    mutable buffer : bytes;
    position : int;
  }

and that's what Tuareg does. Out of the box, ocp-indent seems to go for:

type t = {
  mutable buffer : bytes;
  position : int;
}

In any case, I wouldn't align position with buffer.

@Niols
Copy link
Member

Niols commented Sep 8, 2022

Let me make it clearer that what Tuareg and ocp-indent do is only the indentation. I tend to break record definitions (type or value) just like you did. I've also seen:

type t =
  { mutable buffer : bytes;
    position : int }

but I suppose we can rule that out as per the diff rule.

I also suppose that, for short records, a user might want to go for:

type t = { mutable buffer : bytes; position : int }

so maybe there is a vertical vs. horizontal layout consideration here.

@Niols
Copy link
Member

Niols commented Sep 8, 2022

(Should have answered to this in a clearer way)

Specifically, would this be a good output?

Yes!

@torhovland
Copy link
Member Author

torhovland commented Sep 8, 2022

That last one is allowed, because the user decides between single/multi-line.

The ocp-indent is the easiest to get working, because it's a simple indent start/end on each brace. The Tuareg format may be possible, but it would mean a pair of indent starts on opening brace, and then prepending and appending indent ends to the closing brace. Might be easy, but the query gets a little more complex. I'll stick with the simplest solution for now, then.

@Niols
Copy link
Member

Niols commented Sep 8, 2022

That's definitely a good reason to go for ocp-indent's style. I suppose this calls for a side-project in the future that provides a configuration for Tuareg compatible with tree-sitter-formatter's output.

@torhovland
Copy link
Member Author

Yes, either that, or we simply extend our query to support Tuareg.

Anyway, let's see what other challenges I run into first. Such as this kind of indenting:

let to_seq b =
  let rec aux i () =
    if i >= b.position then Seq.Nil
    else
      let x = Bytes.unsafe_get b.buffer i in
      Seq.Cons (x, aux (i+1))
  in
  aux 0

@aspiwack
Copy link
Member

aspiwack commented Sep 8, 2022

I'm under the – possibly naive – belief that ocp-indent is used much more than Tuareg's native indenter these days. At any rate, it's the style I've been using for many years. So I'm all for it.

Let's normalise the final semicolon, please.

@aspiwack
Copy link
Member

aspiwack commented Sep 8, 2022

Anyway, let's see what other challenges I run into first. Such as this kind of indenting:

let to_seq b =
  let rec aux i () =
    if i >= b.position then Seq.Nil
    else
      let x = Bytes.unsafe_get b.buffer i in
      Seq.Cons (x, aux (i+1))
  in
  aux 0

This should format to

let to_seq b =
  let rec aux i () =
    if i >= b.position then
      Seq.Nil
    else
      let x = Bytes.unsafe_get b.buffer i in
      Seq.Cons (x, aux (i+1))
  in
  aux 0

@Niols
Copy link
Member

Niols commented Sep 8, 2022

I'd go exactly like Arnaud on this one.

@Niols
Copy link
Member

Niols commented Sep 8, 2022

I'm under the – possibly naive – belief that ocp-indent is used much more than Tuareg's native indenter these days. At any rate, it's the style I've been using for many years. So I'm all for it.

I think so. And OCaml is definitely used outside of Emacs a lot nowadays.

Let's normalise the final semicolon, please.

Yes!

@torhovland
Copy link
Member Author

I added #42.

@torhovland
Copy link
Member Author

torhovland commented Sep 8, 2022

Do you want to put a hard line break after then in:

if i >= b.position then Seq.Nil

or do you consider this a soft line that the user has decided to be a single line?

@aspiwack
Copy link
Member

aspiwack commented Sep 8, 2022

I think that if u then v else w is a single form. If it's single line, it should be printed just like I did, if it's multi-line, it should be

if u then
  v
else
  w

@torhovland
Copy link
Member Author

torhovland commented Sep 8, 2022

Thing is that in the CST it is structured like this:

if_expression (multi-line)
  if infix_expression then_clause (siblings on the same line)
  else_clause (multi-line sibling)

By the multi-line rule everything before else needs to be allowed to stay single-line, then. Agreed?

If you really want a line break there, it is up to you as the programmer to put it there, isn't it?

@torhovland
Copy link
Member Author

Another thing: ">=" and "+" are both infix_operator. It would be easiest if they had consistent spacing around them.

@Niols
Copy link
Member

Niols commented Sep 8, 2022

I'm not sure I understand the CST considerations. But does that mean that both

if u then let x = v1 in v2
else let y = w1 in w2
if u then let x = v1 in v2
else
  let y = w1 in
  w2
if u then
  let x = v1 in
  v2
else let y = w1 in w2
if u then
  let x = v1 in
  v2
else
  let y = w1 in
  w2

are possible depending on the user's choice? It sounds reasonable, especially if it makes for a simpler query, but 2) and 3) look quite odd to me.

@torhovland
Copy link
Member Author

Let me check as soon as I am done with the example further up 😄

@torhovland
Copy link
Member Author

Currently getting this:

let to_seq b =
  let rec aux i () =
    (* Note that b.position is not a constant and cannot be lifted out of aux *)
    if i >= b.position then Seq.Nil
    else
      let x = Bytes.unsafe_get b.buffer i in
      Seq.Cons (x, aux (i + 1))
    in
  aux 0

I think that's pretty close 🎉

@Niols
Copy link
Member

Niols commented Sep 8, 2022

Pretty good indeed!

@torhovland torhovland mentioned this issue Sep 8, 2022
@aspiwack
Copy link
Member

aspiwack commented Sep 8, 2022

I guess that if we want to allow both (1) and (4) from @Niols (which is a reasonable choice) we also need to allow (2) and (3). Fair enough.

This was referenced Sep 12, 2022
@torhovland
Copy link
Member Author

I'm now able to format the entire OCaml sample file reasonably well. Not perfectly. But we can start logging specific formatting issues now, I think.

@Niols Your four examples are formatted as expected, see https://github.com/tweag/tree-sitter-formatter/pull/53/files#diff-4b9fd3e2640276ee705e5851f570dca4d2cf44e2884a6c3038ed788cb3a84a85

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants