Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Theory of JSON parsing #1403

Merged
merged 16 commits into from
Mar 1, 2025
Merged

Theory of JSON parsing #1403

merged 16 commits into from
Mar 1, 2025

Conversation

didriklundberg
Copy link
Contributor

This small JSON parsing theory, written by @arolle and myself, has been validated against the "Parsing JSON is a Minefield" test suite. @myreen suggested I make a PR in some previous discussion - please give feedback on whether there is additional work or cleanup needed.

@didriklundberg didriklundberg marked this pull request as draft February 19, 2025 15:56
@didriklundberg
Copy link
Contributor Author

Converted to draft: I'll add another commit with some minor fixes and cleanup.

@mn200
Copy link
Member

mn200 commented Feb 21, 2025

I'd be happy to take delivery of this more or less as is, but do take your time polishing to your satisfaction. One thing, I'd particularly like to see, but can do myself if you prefer is to really use :int and not :num.

@didriklundberg
Copy link
Contributor Author

I'd be happy to take delivery of this more or less as is, but do take your time polishing to your satisfaction. One thing, I'd particularly like to see, but can do myself if you prefer is to really use :int and not :num.

Thanks for your feedback. I would also hope for a better number representation than the current one, and I tried rewriting the theory to use int instead of sign # num. However, this results in failing validation tests due to not keeping the distinction between 0 and -0, which are indeed distinct e.g. in the IEEE 754 standard. Even worse, since the fractional component is still separate, -0.3 gets wrongly represented as 0.3.

Should I try using rat and ignore the 0 vs. -0 issue, or keep things as they are and just add a comment based on this discussion?

@mn200
Copy link
Member

mn200 commented Feb 21, 2025

This is an interesting question. Of course, if you're committed to being that accurate, you might use HOL's floating point type.

@didriklundberg
Copy link
Contributor Author

didriklundberg commented Feb 24, 2025

This is an interesting question. Of course, if you're committed to being that accurate, you might use HOL's floating point type.

Thanks for the tip. I've looked through floating-point/binary_ieeeScript.sml and float/ieeeScript.sml, with the following conclusions:

  • binary_ieeeScript defines binary floats in terms of words (for which you commit to a specific precision), so this would not be usable for JSON numbers, which require an arbitrary-precision decimal floating point representation.
  • ieeeScript does not define a float type as such. Rather, the arithmetic defined therein operates on a tuple of nums and also assumes a binary floating-point representation as opposed to a decimal one. Representing decimal fractional numbers using binary floating-point numbers necessarily incurs rounding errors, which we can't have. So there seems to be no clear benefit in using the conventions of ieeeScript, since the definitions would not be usable anyhow.

A brief search reveals there was an extension of Harrison's formalisation (which ieeeScript is based on?) to arbitrary radix in 2015 (paper, relevant code). Although their generalized representation seems a bit weird (e.g. "zero is not a floating point number"), it might be useful as inspiration for a true arbitrary-precision arbitrary-radix floating point representation. But this is probably too much for this PR.

Therefore, I suggest I document the JSON number representation further, define a separate type for it, then provide mechanisms for extracting rats and ints. This would allow accurate serialisation+deserialisation, and some compatibility with existing HOL4 formalizations. What do you think?

…e_jsonTheory. Small correction to treatment of trailing zeroes
@mn200
Copy link
Member

mn200 commented Feb 24, 2025

JSON RFC says

This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
generally available and widely used, good interoperability can be
achieved by implementations that expect no more precision or range
than these provide, in the sense that implementations will
approximate JSON numbers within the expected precision. A JSON
number such as 1E400 or 3.141592653589793238462643383279 may indicate
potential interoperability problems, since it suggests that the
software that created it expects receiving software to have greater
capabilities for numeric magnitude and precision than is widely
available.

Note that when such software is used, numbers that are integers and
are in the range [-(253)+1, (253)-1] are interoperable in the
sense that implementations will agree exactly on their numeric
values.

which suggests to me that you could use src/floating-point’s 64 bit float instantiation.

@didriklundberg
Copy link
Contributor Author

JSON RFC says

This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
generally available and widely used, good interoperability can be
achieved by implementations that expect no more precision or range
than these provide, in the sense that implementations will
approximate JSON numbers within the expected precision. A JSON
number such as 1E400 or 3.141592653589793238462643383279 may indicate
potential interoperability problems, since it suggests that the
software that created it expects receiving software to have greater
capabilities for numeric magnitude and precision than is widely
available.

Note that when such software is used, numbers that are integers and
are in the range [-(253)+1, (253)-1] are interoperable in the
sense that implementations will agree exactly on their numeric
values.

which suggests to me that you could use src/floating-point’s 64 bit float instantiation.

Yes, the RFC says that 64-bit implementations have "good interoperability", but for full interoperability you must support arbitrary precision. You can have a RFC-compliant 64-bit implementation, but it wouldn't be interoperable with all other RFC-compliant implementations as far as numbers are concerned. This was the main priority I had in mind when writing this (specifically, permitting as many cases prefixed by "i" in this list as possible).

Of course, there could also be a point in having an implementation that is limited to 64-bit floats, if you want to exclude the existence of more precise floats. It would be easy to just create another file and have, for example, json_bigfloatScript.sml and json_ieee754Script.sml with separate number types. In practice, restriction to 64-bit floats could of course also be accomplished by just using a preprocessor and postprocessor when serialising and deserialising arbitrary-precision numbers.

It's your call in the end and if you only want json_ieee754Script.sml, and/or any of the above alternatives, then I'll add another commit to fix that.

@mn200
Copy link
Member

mn200 commented Feb 28, 2025

Thanks for taking the time over this. I agree with you that the current approach is probably the best match for the full generality of what the RFC requires.

@didriklundberg didriklundberg marked this pull request as ready for review February 28, 2025 13:02
@didriklundberg
Copy link
Contributor Author

Thanks again for your feedback. I have now marked the PR as ready for review and merge.

@mn200 mn200 merged commit dae42fb into HOL-Theorem-Prover:develop Mar 1, 2025
4 checks passed
@mn200
Copy link
Member

mn200 commented Mar 1, 2025

Thanks again! May move within the examples directory, perhaps to be under examples/formal-languages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants