Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nix parse: parse a nix expr or nix file to aterm or json syntax tree (AST) #5512

Closed
wants to merge 28 commits into from

Conversation

milahu
Copy link
Contributor

@milahu milahu commented Nov 7, 2021

fix #4726
rebase of #4731

the nix parse CLI looks like this

nix parse input-file.nix
nix parse --expr '(if true then true else false)'
nix parse input-file.nix --output-format json
nix parse input-file.nix --output-format aterm # default format

stress test: parse nixpkgs/pkgs/top-level/all-packages.nix → success, jq says the json is valid

json output sample
nix parse --expr '(if true then true else false)' --output-format json | jq
{
  "type": "ExprIf",
  "line": 1,
  "column": 2,
  "cond": {
    "type": "ExprVar",
    "line": 1,
    "column": 5,
    "name": "true"
  },
  "then": {
    "type": "ExprVar",
    "line": 1,
    "column": 15,
    "name": "true"
  },
  "else": {
    "type": "ExprVar",
    "line": 1,
    "column": 25,
    "name": "false"
  }
}


void Expr::showAsJson(std::ostream & str) const
{
abort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably produce a descriptive error where someone has to look to fix the situation. I could imagine that with the next subclass of Expr it might be easily forgotten and then this will only yield an obscure runtime error.

Copy link
Contributor Author

@milahu milahu Nov 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm ... this is just copy-paste from Expr::show which i renamed to Expr::showAsAterm

void Expr::showAsAterm(std::ostream & str) const
{
    abort();
}

i guess this means "end of input"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, this sin't "end of input" but instead terminates the process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was added in d4f0b0fc6c by @edolstra - maybe he knows : )
my second guess is: this means "empty input"


void ExprInt::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprInt << "\"";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we could use a proper JSON serialization library instead? Nix already uses nlohmann JSON and has a home-grown JSON serialiser in src/libutil/json.hh.

Could the interface could be void ExprType::showAsJson(json& list) instead and an example implementation could look like this:

void ExprType::showAsJson(json& list) {
  json elem = {
    { "type", NodeTypeName::ExprType },
    { "value", n },
  };
  list.push_back(elem);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a proper JSON serialization library

define proper ... this is working, and its fast
i did not use nlohmann, cos it has a different internal datastructure, iirc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON serialiser in src/libutil/json.hh

would make an interesting benchmark ... someone, anyone? ; )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... i would rather add a machine-readable json format, as i described in #4726 (comment)
benefit: json is smaller, can be parsed faster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would make an interesting benchmark ... someone, anyone? ; )
My argument here isn't about performance at all. It is about correctness. It avoid programming errors that we can rule out by not doing the stringy stuff on our own. The nlohmann example also makes the code less verbose IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you care about performance, maybe JSON is not the best format you can aim for. And if you really care about performance, you should aim for a lazy event-driven API (the XML parser folks pioneered this and called it "SAX").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use a serializer that is already done.

just merge my patch. : P

the whole point of my patch is to provide a FAST nix-to-json parser
and your "pretty" solution via some other high-level json printer is (probably) slower

"probably" = i did no benchmark, but its an educated guess

compare my 30 lines in src/libexpr/nixexpr-as-json.cc
(no need to print ascii, no need to validate input)

with 250 lines in nlohmann's dump_escaped
https://github.com/nlohmann/json/blob/develop/include/nlohmann/detail/output/serializer.hpp#L238
https://github.com/nlohmann/json/blob/develop/include/nlohmann/detail/output/serializer.hpp#L382

it requires special care whenever another Expr type is being added

how often does that happen? once in five years?

the only "special care" i can think of is keeping backward compatibility
by appending new types at end-of-file in src/libexpr/nixexpr-node-types.def

when someone adds a character that you aren't escaping already

what do you mean?
the json spec is constant
when bugs appear, someone will fix them

Copy link
Contributor Author

@milahu milahu Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf8 test

#!/bin/sh

nix="./outputs/out/bin/nix --extra-experimental-features nix-command"

echo "1 test null byte"
printf '"a \0 z"' | $nix parse --output-format json /dev/stdin | jq
# error: syntax error, unexpected end of file, expecting '"'

echo "2 test valid utf8"
inputString=$'one \1 two \2 three \3 four \4 skull \xE2\x98\xA0 lol \U0001f602'
printf '"%s"' "$inputString" | $nix parse --output-format json /dev/stdin | jq

echo "3 test all bytes except null"
inputString=''
for i in $(seq 1 255)
do
  inputString+="$i=$(printf "0x%02i" $i | xxd -r | sed 's/"/\\"/') "
  # must escape " for nix
done
printf '"%s"' "$inputString" | $nix parse --output-format json /dev/stdin | jq

echo "4 test invalid utf8"
# invalid utf8 is ignored by nix parser, so its simply passed through
# https://stackoverflow.com/questions/1301402/example-invalid-utf8-string
inputString=$'\xc3\x28 \xa0\xa1 \xe2\x28\xa1'
printf '"%s"' "$inputString" | $nix parse --output-format json /dev/stdin | jq

echo "5 test random utf8 in 100 byte blocks ... this will loop forever"
while true; do
# https://unix.stackexchange.com/questions/245623/how-do-i-create-a-text-file-1-gigabyte-containing-random-characters-with-utf-8
inputString="$(
  dd if=/dev/urandom bs=100 count=1 status=none | perl -CO -ne '
    BEGIN{$/=\4}
    no warnings "utf8";
    print chr(unpack("L>",$_) & 0x7fffffff)
  '
)"
printf '"%s"' "$inputString" | $nix parse --output-format json /dev/stdin | jq >/dev/null || {
  echo "json error in inputString:"
  echo "$inputString" | hexdump -C
}
sleep 0.1 # make the loop easier to kill
done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you care about performance, maybe JSON is not the best format you can aim for

protobuf

And if you really care about performance, you should aim for a lazy event-driven API

yepp, a stream parser, let me put this on my neverending todo list : D

Copy link
Contributor Author

@milahu milahu Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when someone adds a character that you aren't escaping already

im starting to understand your concern

so, what im currently doing:
escape bytes 0 to 92 = ascii control chars (\n \r \b ...), doublequotes, backslash
for example, null byte → \u0000

echo '"\u0000"' | jq >/dev/null && echo valid
valid

everything else (bytes 93 to 255) is simply passed through as is

does this work with unicode?
as for the json spec, it is allowed to have raw unicode in json strings
when unicode is used in json, it must be utf16, for example \u0001 (nope, utf16 in javascript)

can the escaping break valid unicode?
lets look at the valid unicode byte ranges
ascii bytes are from 0 to 127 = 7 bit
so the first unicode byte must be 128 to 255

what about the following unicode bytes?

https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-that-a-string-is-valid-unicode-utf-8/
converted with hex2dec.py

First Byte Second Byte Third Byte Fourth Byte
[0,127]      
[194,223] [128,191]    
224 [160,191] [128,191]  
[225,236] [128,191] [128,191]  
237 [128,159] [128,191]  
[238,239] [128,191] [128,191]  
240 [144,191] [128,191] [128,191]
[241,243] [128,191] [128,191] [128,191]
244 [128,143] [128,191] [128,191]

so ... ALL unicode bytes are in the range from 128 to 255
and since i escape only bytes 0 to 92, this works : )

one rare edgecase, where this could break: non-unicode input, for example latin1 encoding.
some ascii control chars, like \1 are encoded as unicode \u0001,
so the result string can be mixed unicode and latin1. simple solution: blame the user.
either for throwing non-unicode strings at nix, or for throwing ascii control chars at nix.
(ascii control chars are worse than non-unicode strings)

src/nix/parse.cc Outdated Show resolved Hide resolved
@gilligan
Copy link
Contributor

gilligan commented Nov 9, 2021

I absolutely don’t understand who or what should benefit from re-inventing the wheel here with problems that have already been solved.

@Ericson2314 had created #3884 long ago to consolidate things and use only nlohman/json (not directly related but same principle) and this PR now is another example of fiddling with JSON and dealing with problems the nix code should not have to deal with.

Please just don’t. Optimize for ease of use, for stability and clean, fewer code instead of trying to make something faster just because you can.

@milahu
Copy link
Contributor Author

milahu commented Nov 9, 2021

Optimize for ease of use, for stability and clean, fewer code
instead of trying to make something faster just because you can.

optimize for correct results and optimal performance
instead of trying to make something "pretty" just because you can

you have not one example where my code is failing
except in the unrealistic edgecases (ascii control chars + latin1) i described here

@Ericson2314 had created #3884 long ago to consolidate things and use only nlohman/json

this nlohman/json library must be the greatest thing ever invented ...

did you read the code for the nlohman/json string serializer?
why should i waste my cpu on this nonsense?

maybe i should start the issue Get rid of nlohmann/json and just use json.hh

@piegamesde
Copy link
Member

@milahu Please take a step back and reflect on what you are writing here. If you'd like to know how to convince maintainers to merge your patch, well, this is how you don't. (For a start, try dropping the sarcasm and the personal attacks.)

Don't forget that somebody will have to maintain what you are writing. Nix is already a large enough pile of hacks that could fancy some cleanup, and you are just dumping more work onto that pile. Of course this will be met with contempt.

On another note, I still think JSON is not what you want if you really care that much about performance. It's like you're trying to make your go-kart a bit faster but actually you want a racing car.

@milahu
Copy link
Contributor Author

milahu commented Nov 9, 2021

convince maintainers to merge your patch

rich hearts wear poor coats
i know who i am, no need to fit your false expects

somebody will have to maintain what you are writing

what exactly does that mean?
a new nix syntax? how often does that happen? daily? once every 1000 years?

all i hear is lazy/stupid excuses, and youre surprised im pissed?

nix is a low level tool, of course performance matters here
when "some people" dont get that
mabye they should stay away from the "dirty" backend work

I still think JSON is not what you want if you really care that much about performance

off topic. this patch aims for a fast nix-to-json parser

@milahu
Copy link
Contributor Author

milahu commented Nov 18, 2021

this patch aims for a fast nix-to-json parser

better reasons to NOT use nix as parser:

nix parse will produce abstract syntax trees,
not concrete syntax trees as with rnix-parser

for example, nix parse will

also, no incremental parsing, as with tree-sitter-nix or lezer-parser-nix

@stale
Copy link

stale bot commented Jun 12, 2022

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Jun 12, 2022
@Ericson2314
Copy link
Member

Ericson2314 commented Jun 22, 2023

It would be indeed nice to be able to dump the AST to JSON, but it should use nlohmann JSON. We've indeed made the other stuff all nlohmann JSON now.

@stale stale bot removed the stale label Jun 22, 2023
@piegamesde
Copy link
Member

OP has been banned from the organisation, I think this can safely be closed

@zimbatm zimbatm closed this Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature: parse nix expression to json ast
6 participants