Skip to content

feat: Allow in_array expression to fetch all available items - related to #68 #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

das-peter
Copy link

This is a quick POC related to:

Reason I'd like to see this addition is that some optimized json data schemas normalize the data in order to optimize transport size.
This means references are collected in a meta-data space while the original data structure only contains references to that meta-data space.

A good example is https://jsonapi.org/:

{
  "links": {
    "self": "http://example.com/articles",
    "next": "http://example.com/articles?page[offset]=2",
    "last": "http://example.com/articles?page[offset]=10"
  },
  "data": [{
    "type": "articles",
    "id": "1",
    "attributes": {
      "title": "JSON:API paints my bikeshed!"
    },
    "relationships": {
      "author": {
        "links": {
          "self": "http://example.com/articles/1/relationships/author",
          "related": "http://example.com/articles/1/author"
        },
        "data": { "type": "people", "id": "9" }
      },
      "comments": {
        "links": {
          "self": "http://example.com/articles/1/relationships/comments",
          "related": "http://example.com/articles/1/comments"
        },
        "data": [
          { "type": "comments", "id": "5" },
          { "type": "comments", "id": "12" }
        ]
      }
    },
    "links": {
      "self": "http://example.com/articles/1"
    }
  }],
  "included": [{
    "type": "people",
    "id": "9",
    "attributes": {
      "firstName": "Dan",
      "lastName": "Gebhardt",
      "twitter": "dgeb"
    },
    "links": {
      "self": "http://example.com/people/9"
    }
  }, {
    "type": "comments",
    "id": "5",
    "attributes": {
      "body": "First!"
    },
    "relationships": {
      "author": {
        "data": { "type": "people", "id": "2" }
      }
    },
    "links": {
      "self": "http://example.com/comments/5"
    }
  }, {
    "type": "comments",
    "id": "12",
    "attributes": {
      "body": "I like XML better"
    },
    "relationships": {
      "author": {
        "data": { "type": "people", "id": "9" }
      }
    },
    "links": {
      "self": "http://example.com/comments/12"
    }
  }]
}

With the extension resolving the detail data of the comments of an entry becomes much easier:
$.included[?(@.id in [$.data[0].relationships.comments.data[*].id])]


There are still some open things in my mind:

  • I had to use the $hasDiverged return value to decide if the result needs to be unwrapped in order to make the above outline scenario and the unit tests work. I'm currently not sure what this does and why I have to do this.
    And I think the unit test lacks a test case because currently seems to hit only scenarios where $hasDiverged set to true.
  • The spec adjustments seems wrong / incomplete: in_array = value 'in' '[' value (',' value)* | jsonpath ']' Directly allowing jsonpath conflicts with the "known" limitations of value: "Jsonpaths in value return the first element of the set or false if no result."
    So this conflicts and most likely also hints that this is a API-Break because the limitation essentially no longer exists for in_array - which can change the behavior of existing expressions - which is bad.
    One Idea I could see is that in_array as a more poly-morph syntax - the spec could be: in_array = value 'in' ('[' value (',' value)* ']' | jsonpath)
  • Traditional: $..book[?(@.author in [$.authors[0], $.authors[2]])]
  • Extended: $..book[?(@.author in $.authors)]

Anyhow, it's late here and I'm fighting with falling asleep :)
Feedback would be great but I'll revisit later in any case.

@das-peter
Copy link
Author

I've decided to go with some of the thoughts from last night.
In order to use a multi-value jsonpath expression with in one must not wrap the expression in [].
This keeps a clear separation between how things used to work and the extension.

So an expression like this $..book[?(@.author in [$.authors[*]])] still yields the same results as before because by wrapping the expression into [] it is considered a value which has the one item only "limitation".
In order to use multi-value expression one can use: $..book[?(@.author in $.authors[*])]

This is also way more intuitive because a [] wrap indicates a manually constructed array anyways - so single value handling makes sense from that perspective anyway.


I had to adjust the regexp for BINOP_IN_ARRAY in order to get this capture going. Which also meant I had to add some more complex unwrapping code in the in_array handling itself.

The updated code has tests to check the backward compatibility with $..book[?(@.author in [$.authors[*]])]

I haven't done any performance tests as of now - honestly I'd be hard pressed to see where to squeeze time. Maybe $expression[strlen($expression) - 1] could be replaced with substr($expression, -1) if that's faster.

@das-peter
Copy link
Author

Did a quick test of the string handling options - I don't think there's a to relevant difference but offset access was more often ever so slightly faster:
grafik

Not sure if there's interest in such a performance test in the test folder - let me know and I can commit it if desired.

@Galbar
Copy link
Owner

Galbar commented Feb 17, 2025

Hello @das-peter. First, thanks for the time and effort punt into this feature. I like the idea but, before commiting, I'd like to do some thinking, hopefully we can find the answers together.

Before that, though, you expressed confusion about $hasDiverged, so let me explain: $hasDiverged is true if the json path queries for more than one potential value. $.a.b.c will have $hasDiverged be false, $.a[0,4].c will have $hasDiverged be true. This is primarily used for the SmartGet feature this library implements. JsonPath queries return an array of results, or false if no matches, by definition (even if pointing at a single value), the SmartGet feature is not standard.

I think this comment is a good starting point to the conversation: #68 (comment)

[...]
k in [x, y, z] literally means is k inside of the "container" containing x, y and z. It is intended for k, x, y and z to be specific values (hence the use of Value). If they happen to be a path to multiple values, the first is taken (it could have been a random one, but we chose the first one for consistency) as Value represents a single value.

The syntax you'd hypothetically want is something like value 'in' (jsonpath | childpath) (i.e. 3 in $.path.to.numbers). Notice the lack of [] around it. Meaning that the "container" is the results of the path. This is currently not implemented as part of this library, though. If you want to implement it, I am open to reviewing a PR for it.
[...]

What I'm thinking of right now is:

  1. value 'in' (jsonpath | childpath) should be its own separate expression in the spec and in the language implementation (not reusing "in array"). Maybe with a name like "in query result"? If you can think of a better name, let me know :D
  2. This new expression creates an inconsistency in what a jsonpath means when it appears in an expression, before they were all consistently Values. Now it will be a proper query in this case and thus not a Value. $.a in $.b would translate to "first result of $.a in all results of $.b". If $.b in the object is an array, then the expression will be <some_value> in [[...<results of $.b>]].Is that the right thing to do? Are there alternatives?
  3. Should SmartGet affect this sub-query? I'm thinking yes, for consistency. It would enable $..book[?(@.author in $.authors[*])] to be equivalent to $..book[?(@.author in $.authors)], if SmatGet is enabled. On the other hand, so far the language implementation was agnostic to smart get except for the outer-most layer. Is it worth it changing this? Which would be the right way of doing it?

There seems there is no consensus on what the answer here should be, even for a subset of this feature..

I'd like to hear thoughts on this. Let's make sure we implement this the right way!

@das-peter
Copy link
Author

@Galbar Thank you so much for getting back on this. It looks like I need to do some more reading.
I also need to study https://datatracker.ietf.org/doc/rfc9535/ closer - because if that standard is truly lacking any kind of similar functionality, I really think should be part of either a next iteration or a well defined extension.
Data normalization just seems in the "spirit" of JSON to minimize the data volume -> so we should have the tools to use normalized data.

  1. That makes sense to me. It felt convoluted packing it into in_array - it started to feel like to may pathways.

I'll come back regarding the other things once I feel like I've more qualified ideas.

@Galbar
Copy link
Owner

Galbar commented Feb 18, 2025

For context, this library predates that RFC by many years. If it defines anything that makes this library not-compliant it is most likely that I will ignore it. There were many things in the original spec that were not properly defined and I had to sit down and define them. I stand behind (most of) those decisions.

I have not read that RFC fully nor very attentively. While I was writing my previous comment I tried to find in it any reference to the in operator, I found none. What I did find is a whole section on functions (i.e. length(@.foo)). I can see how the argument could be made that in should be a function. I have no active plans to implement functions, although, as always, if someone comes with a PR I'll review it.

Nevertheless, I think the focus should be in making the feature make sense within the feature-set of this library, hence why my comments revolve around it.

The more I think about it, I think the answer of 3. should be that SmartGet gets propagated. I just have the feeling that propagating it properly is going to be a PITA 😅

I'm still unsure about 2. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants