Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nix parse: parse a nix expr or nix file to aterm or json syntax tree (AST) #5512

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
7211edd
nixexpr.cc: rename show to showAsAterm
milahu Apr 22, 2021
f57c978
nixexpr.hh: rename show to showAsAterm
milahu Apr 22, 2021
7329c19
nixexpr.cc: implement showAsJson methods
milahu Apr 22, 2021
137df01
nixexpr.hh: add showAsJson methods
milahu Apr 22, 2021
a9a0e19
nix-instantiate.cc: parse: add format switch aterm/json
milahu Apr 22, 2021
36581e6
make it work ... almost
milahu Apr 22, 2021
658d93b
use numeric node types. fix string escapes
milahu Apr 23, 2021
d1a5880
refactor: move showAsJson to separate file
milahu Apr 29, 2021
abfe441
refactor enum NodeTypeId, add NodeTypeName
milahu May 1, 2021
cc92fab
add format json-arrays. only 10% faster
milahu May 1, 2021
7a02477
add format json-numtypes
milahu May 2, 2021
76ab163
add format json-arrays-fmt
milahu May 2, 2021
bee4f91
minifix in nixexpr-as-json-arrays.cc
milahu May 3, 2021
4b19384
json-arrays-fmt: use github.com/fmtlib/fmt
milahu May 3, 2021
0e42d01
showAsJson: show line + column
milahu May 3, 2021
1c09c50
add command: nix parse
milahu May 3, 2021
35219ca
rollback: remove experimental json formats
milahu May 4, 2021
0de2923
nix parse: cleanup
milahu May 4, 2021
87fcaa6
cleanup comments
milahu May 4, 2021
8197c5f
fix comments
milahu May 4, 2021
bd55eed
fix rebase: move showAsJson impl to nixexpr-as-json.cc
milahu Nov 6, 2021
182f20e
fix rebase: nixexpr-as-json.cc: matchAttrs -> hasFormals()
milahu Nov 6, 2021
ed18b49
fix rebase
milahu Nov 7, 2021
6adba4a
nix parse: fix CLI
milahu Nov 7, 2021
d50ba1a
parse.cc: dont set state->repair default
milahu Nov 7, 2021
834f883
nixexpr-as-json.cc: fix json syntax
milahu Nov 8, 2021
8ca3912
nixexpr-as-json.cc: char -> const char
milahu Nov 8, 2021
b5f8967
nixexpr-as-json.cc: String_showAsJson: sparse -> dense array
milahu Nov 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
326 changes: 326 additions & 0 deletions src/libexpr/nixexpr-as-json.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,326 @@
#include "nixexpr.hh"
#include "derivations.hh"
#include "util.hh"

#include <cstdlib>

namespace nix {

// binary operators are implemented in nixexpr.hh MakeBinOp

// https://stackoverflow.com/questions/7724448
// note: we use full jump table to make this as fast as possible
// note: we assume valid input. errors should be handled by the nix parser
// 93 * 7 = 651 byte
char String_showAsJson_replace_array[93][7] = {
milahu marked this conversation as resolved.
Show resolved Hide resolved
"\\u0000", "\\u0001", "\\u0002", "\\u0003", "\\u0004", // 0 - 4
"\\u0005", "\\u0006", "\\u0007", "\\b", "\\t", // 5 - 9
"\\n", "\\u000b", "\\f", "\\r", "\\u000e", // 10 - 14
"\\u000f", "\\u0010", "\\u0011", "\\u0012", "\\u0013", // 15 - 19
"\\u0014", "\\u0015", "\\u0016", "\\u0017", "\\u0018", // 20 - 24
"\\u0019", "\\u001a", "\\u001b", "\\u001c", "\\u001d", // 25 - 29
"\\u001e", "\\u001f", " ", "!", "\\\"", // 30 - 34
"#", "$", "%", "&", "'", // 35 - 39
"(", ")", "*", "+", ",", // 40 - 44
"-", ".", "/", "0", "1", // 45 - 49
"2", "3", "4", "5", "6", // 50 - 54
"7", "8", "9", ":", ";", // 55 - 59
"<", "=", ">", "?", "@", // 60 - 64
"A", "B", "C", "D", "E", // 65 - 69
"F", "G", "H", "I", "J", // 70 - 74
"K", "L", "M", "N", "O", // 75 - 79
"P", "Q", "R", "S", "T", // 80 - 84
"U", "V", "W", "X", "Y", // 85 - 89
"Z", "[", "\\\\", // 90 - 92
};

void String_showAsJson(std::ostream & o, const std::string & s) {
for (auto c = s.cbegin(); c != s.cend(); c++) {
if ((std::uint8_t) *c <= 92)
o << String_showAsJson_replace_array[(std::uint8_t) *c];
else
o << *c;
}
}

void Expr::showAsJson(std::ostream & str) const
{
abort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably produce a descriptive error where someone has to look to fix the situation. I could imagine that with the next subclass of Expr it might be easily forgotten and then this will only yield an obscure runtime error.

Copy link
Contributor Author

@milahu milahu Nov 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm ... this is just copy-paste from Expr::show which i renamed to Expr::showAsAterm

void Expr::showAsAterm(std::ostream & str) const
{
    abort();
}

i guess this means "end of input"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, this sin't "end of input" but instead terminates the process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was added in d4f0b0fc6c by @edolstra - maybe he knows : )
my second guess is: this means "empty input"

}

void ExprInt::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprInt << "\"";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we could use a proper JSON serialization library instead? Nix already uses nlohmann JSON and has a home-grown JSON serialiser in src/libutil/json.hh.

Could the interface could be void ExprType::showAsJson(json& list) instead and an example implementation could look like this:

void ExprType::showAsJson(json& list) {
  json elem = {
    { "type", NodeTypeName::ExprType },
    { "value", n },
  };
  list.push_back(elem);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a proper JSON serialization library

define proper ... this is working, and its fast
i did not use nlohmann, cos it has a different internal datastructure, iirc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON serialiser in src/libutil/json.hh

would make an interesting benchmark ... someone, anyone? ; )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... i would rather add a machine-readable json format, as i described in #4726 (comment)
benefit: json is smaller, can be parsed faster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would make an interesting benchmark ... someone, anyone? ; )
My argument here isn't about performance at all. It is about correctness. It avoid programming errors that we can rule out by not doing the stringy stuff on our own. The nlohmann example also makes the code less verbose IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you care about performance, maybe JSON is not the best format you can aim for. And if you really care about performance, you should aim for a lazy event-driven API (the XML parser folks pioneered this and called it "SAX").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use a serializer that is already done.

just merge my patch. : P

the whole point of my patch is to provide a FAST nix-to-json parser
and your "pretty" solution via some other high-level json printer is (probably) slower

"probably" = i did no benchmark, but its an educated guess

compare my 30 lines in src/libexpr/nixexpr-as-json.cc
(no need to print ascii, no need to validate input)

with 250 lines in nlohmann's dump_escaped
https://github.com/nlohmann/json/blob/develop/include/nlohmann/detail/output/serializer.hpp#L238
https://github.com/nlohmann/json/blob/develop/include/nlohmann/detail/output/serializer.hpp#L382

it requires special care whenever another Expr type is being added

how often does that happen? once in five years?

the only "special care" i can think of is keeping backward compatibility
by appending new types at end-of-file in src/libexpr/nixexpr-node-types.def

when someone adds a character that you aren't escaping already

what do you mean?
the json spec is constant
when bugs appear, someone will fix them

Copy link
Contributor Author

@milahu milahu Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf8 test

#!/bin/sh

nix="./outputs/out/bin/nix --extra-experimental-features nix-command"

echo "1 test null byte"
printf '"a \0 z"' | $nix parse --output-format json /dev/stdin | jq
# error: syntax error, unexpected end of file, expecting '"'

echo "2 test valid utf8"
inputString=$'one \1 two \2 three \3 four \4 skull \xE2\x98\xA0 lol \U0001f602'
printf '"%s"' "$inputString" | $nix parse --output-format json /dev/stdin | jq

echo "3 test all bytes except null"
inputString=''
for i in $(seq 1 255)
do
  inputString+="$i=$(printf "0x%02i" $i | xxd -r | sed 's/"/\\"/') "
  # must escape " for nix
done
printf '"%s"' "$inputString" | $nix parse --output-format json /dev/stdin | jq

echo "4 test invalid utf8"
# invalid utf8 is ignored by nix parser, so its simply passed through
# https://stackoverflow.com/questions/1301402/example-invalid-utf8-string
inputString=$'\xc3\x28 \xa0\xa1 \xe2\x28\xa1'
printf '"%s"' "$inputString" | $nix parse --output-format json /dev/stdin | jq

echo "5 test random utf8 in 100 byte blocks ... this will loop forever"
while true; do
# https://unix.stackexchange.com/questions/245623/how-do-i-create-a-text-file-1-gigabyte-containing-random-characters-with-utf-8
inputString="$(
  dd if=/dev/urandom bs=100 count=1 status=none | perl -CO -ne '
    BEGIN{$/=\4}
    no warnings "utf8";
    print chr(unpack("L>",$_) & 0x7fffffff)
  '
)"
printf '"%s"' "$inputString" | $nix parse --output-format json /dev/stdin | jq >/dev/null || {
  echo "json error in inputString:"
  echo "$inputString" | hexdump -C
}
sleep 0.1 # make the loop easier to kill
done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you care about performance, maybe JSON is not the best format you can aim for

protobuf

And if you really care about performance, you should aim for a lazy event-driven API

yepp, a stream parser, let me put this on my neverending todo list : D

Copy link
Contributor Author

@milahu milahu Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when someone adds a character that you aren't escaping already

im starting to understand your concern

so, what im currently doing:
escape bytes 0 to 92 = ascii control chars (\n \r \b ...), doublequotes, backslash
for example, null byte → \u0000

echo '"\u0000"' | jq >/dev/null && echo valid
valid

everything else (bytes 93 to 255) is simply passed through as is

does this work with unicode?
as for the json spec, it is allowed to have raw unicode in json strings
when unicode is used in json, it must be utf16, for example \u0001 (nope, utf16 in javascript)

can the escaping break valid unicode?
lets look at the valid unicode byte ranges
ascii bytes are from 0 to 127 = 7 bit
so the first unicode byte must be 128 to 255

what about the following unicode bytes?

https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-that-a-string-is-valid-unicode-utf-8/
converted with hex2dec.py

First Byte Second Byte Third Byte Fourth Byte
[0,127]      
[194,223] [128,191]    
224 [160,191] [128,191]  
[225,236] [128,191] [128,191]  
237 [128,159] [128,191]  
[238,239] [128,191] [128,191]  
240 [144,191] [128,191] [128,191]
[241,243] [128,191] [128,191] [128,191]
244 [128,143] [128,191] [128,191]

so ... ALL unicode bytes are in the range from 128 to 255
and since i escape only bytes 0 to 92, this works : )

one rare edgecase, where this could break: non-unicode input, for example latin1 encoding.
some ascii control chars, like \1 are encoded as unicode \u0001,
so the result string can be mixed unicode and latin1. simple solution: blame the user.
either for throwing non-unicode strings at nix, or for throwing ascii control chars at nix.
(ascii control chars are worse than non-unicode strings)

str << ",\"value\":" << n;
str << '}';
}

void ExprFloat::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprFloat << "\"";
str << ",\"value\":" << nf;
str << '}';
}

void ExprString::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprString << "\"";
str << ",\"value\":\""; String_showAsJson(str, s); str << "\"";
str << '}';
}

void ExprPath::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprPath << "\"";
str << ",\"value\":\""; String_showAsJson(str, s); str << "\"";
str << '}';
}

void ExprVar::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprVar << "\"";
if (pos.line > 0) {
str << ",\"line\":" << pos.line;
str << ",\"column\":" << pos.column;
}
str << ",\"name\":\""; String_showAsJson(str, name); str << "\"";
str << '}';
}

void ExprSelect::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprSelect << "\"";
if (pos.line > 0) {
str << ",\"line\":" << pos.line;
str << ",\"column\":" << pos.column;
}
str << ",\"set\":"; e->showAsJson(str);
str << ",\"attr\":"; AttrPath_showAsJson(str, attrPath);
if (def) {
str << ",\"default\":"; def->showAsJson(str);
}
str << "}";
}

void ExprOpHasAttr::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprOpHasAttr << "\"";
str << ",\"set\":"; e->showAsJson(str);
str << ",\"attr\":"; AttrPath_showAsJson(str, attrPath);
str << '}';
}

void ExprAttrs::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprAttrs << "\"";
str << ",\"recursive\":" << (recursive ? "true" : "false");
str << ",\"attrs\":[";
bool first = true;
for (auto & i : attrs) {
if (first) first = false; else str << ",";
if (i.second.pos.line > 0) {
str << "{\"line\":" << i.second.pos.line;
str << ",\"column\":" << i.second.pos.column << ',';
}
else {
str << '{';
}
str << "\"inherited\":" << (i.second.inherited ? "true" : "false"); // NOTE inherited is always false. { inherit (scope) attr; } -> { attr = scope.attr; }
str << ",\"name\":\""; String_showAsJson(str, i.first); str << "\"";
if (!i.second.inherited) {
str << ",\"value\":"; i.second.e->showAsJson(str);
}
str << '}';
}
str << "]";
str << ",\"dynamicAttrs\":[";
first = true;
for (auto & i : dynamicAttrs) {
if (first) first = false; else str << ",";
if (i.pos.line > 0) {
str << "{\"line\":" << i.pos.line;
str << ",\"column\":" << i.pos.column << ',';
}
else {
str << '{';
}
str << "\"name\":\""; i.nameExpr->showAsJson(str);
str << ",\"value\":"; i.valueExpr->showAsJson(str);
str << '}';
}
str << "]}";
}

void ExprList::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprList << "\"";
str << ",\"items\":[";
bool first = true;
for (auto & i : elems) {
if (first) first = false; else str << ",";
i->showAsJson(str);
}
str << "]}";
}

void ExprLambda::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprLambda << "\"";
if (pos.line > 0) {
str << ",\"line\":" << pos.line;
str << ",\"column\":" << pos.column;
}
str << ",\"hasFormals\":" << (hasFormals() ? "true" : "false");
if (hasFormals()) {
str << ",\"formals\":[";
bool first = true;
for (auto & i : formals->formals) {
if (first) first = false; else str << ",";
if (i.pos.line > 0) {
str << "{\"line\":" << i.pos.line;
str << ",\"column\":" << i.pos.column << ',';
}
else {
str << '{';
}
str << "\"name\":\""; String_showAsJson(str, i.name); str << "\"";
if (i.def) {
str << ",\"default\":"; i.def->showAsJson(str);
}
str << '}';
}
str << "]";
str << ",\"ellipsis\":" << (formals->ellipsis ? "true" : "false");
}
if (!arg.empty())
str << ",\"arg\":\"" << arg << "\"";
str << ",\"body\":"; body->showAsJson(str);
str << '}';
}

void ExprCall::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprCall << "\"";
if (pos.line > 0) {
str << ",\"line\":" << pos.line;
str << ",\"column\":" << pos.column;
}
str << ",\"function\":";
fun->showAsJson(str);
str << ",\"args\":[";
bool first = true;
for (auto & e : args) {
if (first) first = false; else str << ",";
e->showAsJson(str);
}
str << "]}";
}

void ExprLet::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprLet << "\"";
str << ",\"attrs\":[";
bool first = true;
for (auto & i : attrs->attrs) {
if (first) first = false; else str << ",";
str << "{\"inherited\":" << (i.second.inherited ? "true" : "false");
str << ",\"name\":\""; String_showAsJson(str, i.first); str << "\"";
if (!i.second.inherited) {
str << ",\"value\":"; i.second.e->showAsJson(str);
}
str << '}';
}
str << "]";
str << ",\"body\":"; body->showAsJson(str);
str << '}';
}

void ExprWith::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprWith << "\"";
if (pos.line > 0) {
str << ",\"line\":" << pos.line;
str << ",\"column\":" << pos.column;
}
str << ",\"set\":"; attrs->showAsJson(str);
str << ",\"body\":"; body->showAsJson(str);
str << '}';
}

void ExprIf::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprIf << "\"";
if (pos.line > 0) {
str << ",\"line\":" << pos.line;
str << ",\"column\":" << pos.column;
}
str << ",\"cond\":"; cond->showAsJson(str);
str << ",\"then\":"; then->showAsJson(str);
str << ",\"else\":"; else_->showAsJson(str);
str << '}';
}

void ExprAssert::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprAssert << "\"";
if (pos.line > 0) {
str << ",\"line\":" << pos.line;
str << ",\"column\":" << pos.column;
}
str << ",\"cond\":"; cond->showAsJson(str);
str << ",\"body\":"; body->showAsJson(str);
str << '}';
}

void ExprOpNot::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprOpNot << "\"";
str << ",\"expr\":"; e->showAsJson(str);
str << '}';
}

void ExprConcatStrings::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprConcatStrings << "\"";
if (pos.line > 0) {
str << ",\"line\":" << pos.line;
str << ",\"column\":" << pos.column;
}
str << ",\"strings\":[";
bool first = true;
for (auto & i : *es) {
if (first) first = false; else str << ",";
i->showAsJson(str);
}
str << "]}";
}

void ExprPos::showAsJson(std::ostream & str) const
{
str << "{\"type\":\"" << NodeTypeName::ExprPos << "\"";
if (pos.line > 0) {
str << ",\"line\":" << pos.line;
str << ",\"column\":" << pos.column;
}
str << "}";
}

void AttrPath_showAsJson(std::ostream & out, const AttrPath & attrPath)
{
out << "[";
bool first = true;
for (auto & i : attrPath) {
if (!first) out << ','; else first = false;
out << "{";
if (i.symbol.set()) {
out << "\"symbol\":\""; String_showAsJson(out, i.symbol); out << "\"";
}
else {
out << "\"expr\":"; i.expr->showAsJson(out);
}
out << "}";
}
out << "]";
}

}
39 changes: 39 additions & 0 deletions src/libexpr/nixexpr-node-types.def
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
ADD_TYPE(ExprLambda) // aka Function
ADD_TYPE(ExprSet) // complex values ...
ADD_TYPE(ExprList)
ADD_TYPE(ExprAttrs) // NOTE inherited is always false. { inherit (scope) attr; } -> { attr = scope.attr; }
ADD_TYPE(ExprString) // scalar values ...
ADD_TYPE(ExprInt)
ADD_TYPE(ExprFloat)
ADD_TYPE(ExprPath)
ADD_TYPE(ExprBoolean)
ADD_TYPE(ExprNull)
ADD_TYPE(ExprCall)
ADD_TYPE(ExprLet)
ADD_TYPE(ExprWith)
ADD_TYPE(ExprIf)
ADD_TYPE(ExprAssert)
ADD_TYPE(ExprVar)
ADD_TYPE(ExprSelect) // operators ...
ADD_TYPE(ExprApp)
ADD_TYPE(ExprOpNeg) // not used -> __sub 0 e
ADD_TYPE(ExprOpHasAttr)
ADD_TYPE(ExprOpConcatLists)
ADD_TYPE(ExprOpMul) // not used -> __mul e1 e2
ADD_TYPE(ExprOpDiv) // not used -> __div e1 e2
ADD_TYPE(ExprConcatStrings) // or ExprOpAdd [ambiguous]
ADD_TYPE(ExprOpAdd) // not used -> ExprConcatStrings e1 e2 [ambiguous]
ADD_TYPE(ExprOpSub) // not used -> __sub e1 e2
ADD_TYPE(ExprOpNot)
ADD_TYPE(ExprOpUpdate)
ADD_TYPE(ExprOpLt) // not used
ADD_TYPE(ExprOpLte) // not used
ADD_TYPE(ExprOpGt) // not used
ADD_TYPE(ExprOpGte) // not used
ADD_TYPE(ExprOpEq)
ADD_TYPE(ExprOpNEq)
ADD_TYPE(ExprOpAnd)
ADD_TYPE(ExprOpOr)
ADD_TYPE(ExprOpImpl)
ADD_TYPE(ExprPos) // TODO what is Pos?
ADD_TYPE(Comment) // not used
Loading