Theory of JSON parsing #1403

didriklundberg · 2025-02-17T14:24:29Z

This small JSON parsing theory, written by @arolle and myself, has been validated against the "Parsing JSON is a Minefield" test suite. @myreen suggested I make a PR in some previous discussion - please give feedback on whether there is additional work or cleanup needed.

didriklundberg · 2025-02-19T15:57:35Z

Converted to draft: I'll add another commit with some minor fixes and cleanup.

…eory

mn200 · 2025-02-21T00:03:19Z

I'd be happy to take delivery of this more or less as is, but do take your time polishing to your satisfaction. One thing, I'd particularly like to see, but can do myself if you prefer is to really use :int and not :num.

didriklundberg · 2025-02-21T11:13:30Z

I'd be happy to take delivery of this more or less as is, but do take your time polishing to your satisfaction. One thing, I'd particularly like to see, but can do myself if you prefer is to really use :int and not :num.

Thanks for your feedback. I would also hope for a better number representation than the current one, and I tried rewriting the theory to use int instead of sign # num. However, this results in failing validation tests due to not keeping the distinction between 0 and -0, which are indeed distinct e.g. in the IEEE 754 standard. Even worse, since the fractional component is still separate, -0.3 gets wrongly represented as 0.3.

Should I try using rat and ignore the 0 vs. -0 issue, or keep things as they are and just add a comment based on this discussion?

mn200 · 2025-02-21T23:14:33Z

This is an interesting question. Of course, if you're committed to being that accurate, you might use HOL's floating point type.

didriklundberg · 2025-02-24T14:12:19Z

This is an interesting question. Of course, if you're committed to being that accurate, you might use HOL's floating point type.

Thanks for the tip. I've looked through floating-point/binary_ieeeScript.sml and float/ieeeScript.sml, with the following conclusions:

binary_ieeeScript defines binary floats in terms of words (for which you commit to a specific precision), so this would not be usable for JSON numbers, which require an arbitrary-precision decimal floating point representation.
ieeeScript does not define a float type as such. Rather, the arithmetic defined therein operates on a tuple of nums and also assumes a binary floating-point representation as opposed to a decimal one. Representing decimal fractional numbers using binary floating-point numbers necessarily incurs rounding errors, which we can't have. So there seems to be no clear benefit in using the conventions of ieeeScript, since the definitions would not be usable anyhow.

A brief search reveals there was an extension of Harrison's formalisation (which ieeeScript is based on?) to arbitrary radix in 2015 (paper, relevant code). Although their generalized representation seems a bit weird (e.g. "zero is not a floating point number"), it might be useful as inspiration for a true arbitrary-precision arbitrary-radix floating point representation. But this is probably too much for this PR.

Therefore, I suggest I document the JSON number representation further, define a separate type for it, then provide mechanisms for extracting rats and ints. This would allow accurate serialisation+deserialisation, and some compatibility with existing HOL4 formalizations. What do you think?

…e_jsonTheory. Small correction to treatment of trailing zeroes

mn200 · 2025-02-24T23:03:19Z

JSON RFC says

This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
generally available and widely used, good interoperability can be
achieved by implementations that expect no more precision or range
than these provide, in the sense that implementations will
approximate JSON numbers within the expected precision. A JSON
number such as 1E400 or 3.141592653589793238462643383279 may indicate
potential interoperability problems, since it suggests that the
software that created it expects receiving software to have greater
capabilities for numeric magnitude and precision than is widely
available.

Note that when such software is used, numbers that are integers and
are in the range [-(253)+1, (253)-1] are interoperable in the
sense that implementations will agree exactly on their numeric
values.

which suggests to me that you could use src/floating-point’s 64 bit float instantiation.

didriklundberg · 2025-02-25T09:16:40Z

JSON RFC says

This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
generally available and widely used, good interoperability can be
achieved by implementations that expect no more precision or range
than these provide, in the sense that implementations will
approximate JSON numbers within the expected precision. A JSON
number such as 1E400 or 3.141592653589793238462643383279 may indicate
potential interoperability problems, since it suggests that the
software that created it expects receiving software to have greater
capabilities for numeric magnitude and precision than is widely
available.

Note that when such software is used, numbers that are integers and
are in the range [-(253)+1, (253)-1] are interoperable in the
sense that implementations will agree exactly on their numeric
values.

which suggests to me that you could use src/floating-point’s 64 bit float instantiation.

Yes, the RFC says that 64-bit implementations have "good interoperability", but for full interoperability you must support arbitrary precision. You can have a RFC-compliant 64-bit implementation, but it wouldn't be interoperable with all other RFC-compliant implementations as far as numbers are concerned. This was the main priority I had in mind when writing this (specifically, permitting as many cases prefixed by "i" in this list as possible).

Of course, there could also be a point in having an implementation that is limited to 64-bit floats, if you want to exclude the existence of more precise floats. It would be easy to just create another file and have, for example, json_bigfloatScript.sml and json_ieee754Script.sml with separate number types. In practice, restriction to 64-bit floats could of course also be accomplished by just using a preprocessor and postprocessor when serialising and deserialising arbitrary-precision numbers.

It's your call in the end and if you only want json_ieee754Script.sml, and/or any of the above alternatives, then I'll add another commit to fix that.

mn200 · 2025-02-28T03:31:41Z

Thanks for taking the time over this. I agree with you that the current approach is probably the best match for the full generality of what the RFC requires.

didriklundberg · 2025-02-28T13:06:58Z

Thanks again for your feedback. I have now marked the PR as ready for review and merge.

mn200 · 2025-03-01T01:43:30Z

Thanks again! May move within the examples directory, perhaps to be under examples/formal-languages.

didriklundberg and others added 14 commits January 13, 2025 14:25

Initial commit of JSON parser

d9ede42

json: revise parsing approach, and add serialiser

e258438

json: correct theory name

0f2631d

revise termination proofs

876bd42

make proof more resilient wrt renaming

599a778

minor

d25b00c

reverse tokens in json_to_tok

25f8110

add statements of interesting theorems

3c96c37

json: recurse into nested arrays

a3a7acc

json: merge some parser cases

ca1c5d8

json: add a few more examples

8c49e26

Extended number format from natural numbers to integers

f530937

Parsing of numbers in JSON-style scientific notation

58ff0b3

Comment out cheated theorems in parse_jsonTheory

44a9130

didriklundberg marked this pull request as draft February 19, 2025 15:56

Cleanup and fixes to half-baked number representation in parse_jsonTh…

0cf54e3

…eory

Add discussion on number representation, conversion functions to pars…

7775d53

…e_jsonTheory. Small correction to treatment of trailing zeroes

didriklundberg marked this pull request as ready for review February 28, 2025 13:02

mn200 merged commit dae42fb into HOL-Theorem-Prover:develop Mar 1, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Theory of JSON parsing #1403

Theory of JSON parsing #1403

didriklundberg commented Feb 17, 2025

didriklundberg commented Feb 19, 2025

mn200 commented Feb 21, 2025

didriklundberg commented Feb 21, 2025

mn200 commented Feb 21, 2025

didriklundberg commented Feb 24, 2025 •

edited

Loading

mn200 commented Feb 24, 2025

didriklundberg commented Feb 25, 2025

mn200 commented Feb 28, 2025

didriklundberg commented Feb 28, 2025

mn200 commented Mar 1, 2025

Theory of JSON parsing #1403

Theory of JSON parsing #1403

Conversation

didriklundberg commented Feb 17, 2025

didriklundberg commented Feb 19, 2025

mn200 commented Feb 21, 2025

didriklundberg commented Feb 21, 2025

mn200 commented Feb 21, 2025

didriklundberg commented Feb 24, 2025 • edited Loading

mn200 commented Feb 24, 2025

didriklundberg commented Feb 25, 2025

mn200 commented Feb 28, 2025

didriklundberg commented Feb 28, 2025

mn200 commented Mar 1, 2025

didriklundberg commented Feb 24, 2025 •

edited

Loading