Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--json output is absurdly huge #562

Open
MajorDallas opened this issue Sep 8, 2023 · 0 comments
Open

--json output is absurdly huge #562

MajorDallas opened this issue Sep 8, 2023 · 0 comments

Comments

@MajorDallas
Copy link

I wrote this on #215 day before yesterday:

... the [json] file size was over 10,000x that of the html and md files. No joke: the same diff produced a 113KB html file and a 1.1GB json file 🤯 I've had some difficulty doing anything with it to see what data it contains...

The "difficulty" was that all tools I had on hand for analyzing json were getting OOMKilled before they could finish parsing the file. They all ate up a full 9.8GB of RAM within about 45 seconds and died after a few minutes.

Luckily I learned about the --stream option for jq. It still ate up every free byte of memory that it could and took about five minutes to do anything, but it didn't crash and it got me started.

This SO answer helped me figure out what fields to look up using a generalization of this other answer. Between the two of these, I managed to extract a single endpoint of interest from changedOperations into a file of 729KB--which is almost 2x as large as the entire original spec.

I specifically was trying to find why the html diff was telling me an endpoint had a breaking change but showed identical schema. Turned out one field's maxLength property was reduced, but the html (and md) output doesn't include that level of detail. But more importantly: while investigating that, I noticed a number of objects and structures were being repeated in various places.

One in particular jumped out at me, so I wrote a little more jq to count up the number and location of occurrences:

[ path(.. | select(type == "object" and has("context"))) as $p
  | { "path": ( $p | join(".") ), "context": ( getpath($p + ["context"]) ) }
]
| group_by(.context)
| [.[0][].path] as $c0paths
| [.[1][].path] as $c1paths
| [.[2][].path] as $c2paths
| [ {"context": (.[0] | first | .context)
	 , "no_paths": ($c0paths | length)
	 , "paths": ($c0paths)}
	, {"context": (.[1] | first | .context)
	   , "no_paths": ($c1paths | length)
	   , "paths": ($c1paths)}
	, {"context": (.[2] | first | .context)
	   , "no_paths": ($c2paths | length)
	   , "paths": ($c2paths)
	}
]
[
  {
    "context": {
      "url": "/authenticate",
      "parameters": {},
      "method": "POST",
      "response": false,
      "request": true,
      "required": true
    },
    "no_paths": 640
  },
  {
    "context": {
      "url": "/authenticate",
      "parameters": {},
      "method": "POST",
      "response": true,
      "request": false
    },
    "no_paths": 14
  },
  {
    "context": {
      "url": "/authenticate",
      "parameters": {},
      "method": "POST",
      "response": false,
      "request": true
    },
    "no_paths": 14
  }
]

(paths arrays ommitted for brevity.)

All told, that accounts for something like 74KB of duplicated data.

I also like this one:

{
    "compatible": false,
    "incompatible": true,
    "unchanged": false,
    "different": true
}

These four fields appear in well over 1,000 places and account for around 82KB. I see an easy way to cut that down to around 41KB...

The operation objects from the original specs are included in their entirety under two top level keys. Then, they're each duplicated under two other keys, accounting for 12KB altogether. Since someone making a diff would presumably have both specs on hand, it's unnecessary to include these full models at all, nevermind four times (that I've found so far).

There are a number of other structures that pop up all over the place, hundreds of times. I don't know how many are exact duplicates of each other, but I'm guessing it's a lot more than is strictly necessary.

I suspect the cause to be that Java objects may be getting serialized into JSON verbatim, including their references to related objects. I'm sure those references are necessary in the Java objects themselves and that they make the library very easy to work with. However, JSON output isn't for a library consumer: the Java object model doesn't make sense in other contexts.

The json output could just use references itself, but, honestly, please don't. Json refs are about as hard to parse as a 1.1GB file, so getting a 1,000-fold disk savings would be just about offset by the need to get tools that can handle refs correctly and also deal with large file sizes.

Instead, I would suggest a better model for the output.

  • Since this is a diff tool, there's no reason to include entire models from the original specs. We only need the things that are actually different.
  • As little duplication as possible. For example, there is no conceivable reason to include that context object over 600 times when there were only three unique contexts--and the three of them share half of their fields. It makes sense in Java; it makes no sense in a file that's to be used by humans (give or take some automated parsing).
  • Put interesting data closer to the top of the object tree. Here's one jq path (of several) to get at the changed maxLength property: .changedOperations[0].requestBody.changedElements[1].changedElements[0].changedElements[0].maxLength. This is almost impossible to review by eyeball (even with pretty-print) and it's a pain to parse out even with tools like jq or nushell's tables and dataframes.

In addition, I would suggest that such a new model be unified across all output formats. Why should html have less information than md, which has less information than stdout, which has less information than json? I just want to get back to writing my API client code knowing that the diff I'm using for guidance is comprehensive and correct, not worry if I missed an obscure property like maxLegth because it was omitted from the output, nor spend a day learning an FP-paradigm DSL just to figure out that's what was omitted!

And as a last thought, which probably deserves its own issue: --yaml option. Compliant YAML parsers should be stream-oriented and capable of parsing in chunks, meaning the issue of running out of memory can be side-stepped even for files that are several GB in size-- unlike json, which must be parsed in full to be understood by the application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant