Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miden IR lexer, parser and AST #43

Closed
wants to merge 12 commits into from
Closed

Miden IR lexer, parser and AST #43

wants to merge 12 commits into from

Conversation

jjcnn
Copy link
Contributor

@jjcnn jjcnn commented Oct 19, 2023

The lexer, parser and AST are ready for review.

The structure of the files is stolen from the AirScript parser. It is possible that I have kept some things from AirScript that we don't need here, and similarly that I have removed some things we do need, so keep an eye out for that during the review. This is especially true for hir-parser/src/parser/mod.rs and hir-parser/src/lexer/mod.rs.

The grammar is almost precisely the one described here: #14 (comment), but with the following adjustments:

  • All questions regarding the size of integer representations (i8, u32, etc.) have been disregarded. The parser parses all numbers as u128 s, and the transformation to other integer size is left to later stages.
  • The keywords false and true have not been added. Booleans are represented using 0 and 1.
  • Function body delimiters are single braces ({ and }) rather than double braces ({{ and }}), because I misunderstood how Rust formatting strings deal with braces.
  • Operands of the ret operation must be contained in parentheses (( and )). This is to avoid an annoying shift/reduce conflict that could otherwise only be handled by adding some sort of end token for instructions (e.g., ';' or '\n') or by adding some really counter-intuitive non-terminals to the grammar. The hir formatter has been updated to reflect this change.
  • Comments of the // type are allowed.

Apart from the tokenization and grammar structure no validation has been implemented. In particular:

  • No typecheck.
  • No check that the number of operands to a call match the number of arguments of the called function.
  • No check that the number of branch operands match the number of arguments of the target block.
  • No check that all read symbols are defined.
  • No check for SSA.
  • No check for multiple definitions of the same name.

The transformation from AST to hir has not been implemented yet, and is not intended to be part of this PR.

I have not yet added any tests. I intend to add a few tomorrow before this is merged, but I believe the code is ready for review as it is.

@jjcnn
Copy link
Contributor Author

jjcnn commented Oct 19, 2023

The CI check is failing due to fmt, mem and num::IntErrorKind not existing in lexer/mod.rs, but they are imported in line 2. @bitwalker or @greenhat could you take at look at it - I have no idea why it isn't going through.

@bitwalker
Copy link
Contributor

@jjcnn The build error is because of the #[cfg(test)] declaration on the use statement, compilation fails when not compiling for the test profile, but it isn't clear to me why the cfg is there to begin with, it might have been accidental.

@jjcnn
Copy link
Contributor Author

jjcnn commented Oct 19, 2023

@jjcnn The build error is because of the #[cfg(test)] declaration on the use statement, compilation fails when not compiling for the test profile, but it isn't clear to me why the cfg is there to begin with, it might have been accidental.

It's a remnant from what I copied from AirScript. I'll remove it.

@jjcnn
Copy link
Contributor Author

jjcnn commented Oct 19, 2023

The CI now fails because of some failing tests in the wasm frontend. I don't believe these have anything to do with my changes.

@bitwalker
Copy link
Contributor

The CI now fails because of some failing tests in the wasm frontend. I don't believe these have anything to do with my changes.

The changes to the formatter for the IR broke those tests because they are asserting that Wasm lowered to the IR produces particular output. Specifically, it seems braces were added to blocks, and that seems to be tripping up those tests.

@greenhat
Copy link
Contributor

The CI now fails because of some failing tests in the wasm frontend. I don't believe these have anything to do with my changes.

The changes to the formatter for the IR broke those tests because they are asserting that Wasm lowered to the IR produces particular output. Specifically, it seems braces were added to blocks, and that seems to be tripping up those tests.

Yes, added braces seem to be the reason. You can regenerate the expected results in all tests with env UPDATE_EXPECT=1 cargo test

@greenhat
Copy link
Contributor

@jjcnn Don't bother and feel free to mark signed_arith test ignored. If you set the exact rust toolchain as on CI it should work though. I'm reworking it in #41 in anyway.

Copy link
Contributor

@bitwalker bitwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @jjcnn! I haven't fully completed my review yet, but there are a handful of changes I know we'll need to make already. I will put together a PR against your branch later today/tonight that implements most/all the described changes, so you don't need to tackle them unless you'd like to do so yourself. Here's the list of things I'd like to change:

  • The hir-parser crate should be merged into the hir crate, i.e. move hir-parser/src to hir/src/parser, move hir-parser/build.rs to hir/build.rs, and add the lalrpop/lalrpop-util crates to the dependencies in hir/Cargo.toml
  • After that, remove symbols.rs, as that is provided by hir-symbol which is already re-exported in the hir crate
  • The ast/types.rs file can also go away, since that is provided by hir-types already
  • A number of AST nodes can go away since their actual HIR definitions are available after the changes above, e.g. CallConvention, Parameter(Extension|Purpose), Identifier, FunctionIdentifier, etc.
  • The Program AST type appears to be a leftover from AirScript, and can be removed, files will only be allowed to contain a single Module. Additionally, GlobalVariable declarations are at the module level
  • The FunctionIdentifier grammar rule should be handled by the lexer instead, i.e. the lexer should parse two different types of identifiers based on whether a :: string is found while lexing an identifier (i.e. if it is observed, lex it as a FunctionIdent, if not, then as a regular Ident). The lexer can handle computing the span which covers the module name(space) vs the function name, and intern those strings to produce the Ident/FunctionIdent we want. This will simplify the grammar a bit, while also giving it more contextual information to reduce the chance of ambiguities.
  • There are a couple of rules which can be simplified (namely those that handle parenthesized parameter lists, see CommaOpt below). I'll do a separate review covering those in more detail when I have a bit more time.

I also noticed that there are a number of clippy warnings unhandled, you'll want to make sure to run cargo clippy -p <crate> and pick those off.

// Rule to parse comma-delimited values with zero or more elements
CommaOpt<T>: Vec<T> = {
    <vals:(<T> ",")*> <last: T?> => {
        let mut vals = vals;
        vals.extend(last);
        vals
    },
};

@jjcnn
Copy link
Contributor Author

jjcnn commented Oct 20, 2023

The Program AST type appears to be a leftover from AirScript, and can be removed, files will only be allowed to contain a single Module.

It's not, actually. I took the program definition from hir/src/program/mod.rs because the Module struct doesn't define an entry point. Is there a convention (e.g., a special name) for a module's entry point?

@jjcnn
Copy link
Contributor Author

jjcnn commented Oct 20, 2023

// Rule to parse comma-delimited values with zero or more elements
CommaOpt<T>: Vec<T> = {
    <vals:(<T> ",")*> <last: T?> => {
      ...
    },
};

Doesn't this rule allow for a sequence that ends with a comma?

@bitwalker
Copy link
Contributor

It's not, actually. I took the program definition from hir/src/program/mod.rs because the Module struct doesn't define an entry point. Is there a convention (e.g., a special name) for a module's entry point?

We don't currently indicate an entry point at all in the output format, and I don't think we ever need to do so. If we do decide to do so at some point, it would probably be by adding support for function attributes, and then decorating the entry function with an #[entrypoint] attribute. For now though, we assume that specifying the entrypoint is done after a Module is parsed into the IR. For ease in tests, we could use a conventional name, e.g. main to automate that step, but in general I'd prefer making that a manual process. Putting that aside though, a Program has to be linked from a set of modules, which isn't something that should be done from the parser, and testing on small units of IR is the primary use case for which the parser is intended, so we only really require formatting/parsing single modules rather than multi-module programs (which we can still construct simply by parsing multiple modules and then linking them together).

Doesn't this rule allow for a sequence that ends with a comma?

Yes, but that's fine IMO. In most cases where comma-separated items are parsed, allowing trailing commas simplifies things and doesn't really have any downside.

@jjcnn jjcnn force-pushed the jjcnn-parser branch 4 times, most recently from 5acb983 to 23fb33c Compare October 24, 2023 22:03
@jjcnn
Copy link
Contributor Author

jjcnn commented Oct 24, 2023

I have now addressed all the review comments, added a test of the parser, and rebased the PR against main.

@bitwalker
Copy link
Contributor

Closing this in favor of #48, which builds on the work done by @jjcnn in this PR.

@bitwalker bitwalker closed this Oct 26, 2023
@bitwalker bitwalker deleted the jjcnn-parser branch May 31, 2024 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants