-
Notifications
You must be signed in to change notification settings - Fork 666
GH-3404: Canonicalize decimals during inlining for TDB2. #3405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hm, I see that this PR causes TestNormalizationTDB2 to fail. The alternative approach to make round-tripping work - hopefully without breaking tests - would be to canonicalize decimals during inlining. Then the NodeId that becomes stored would be that of the canonical decimal. |
9500a18
to
890c10a
Compare
I added a pre-canonicalization step and both the existing tests and the new ones now pass. |
890c10a
to
caa44c1
Compare
} | ||
|
||
/** Return a canonical decimal with a trailing ".0". */ | ||
public static BigDecimal canonicalDecimalWithDot(BigDecimal decimal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this differ from canonicalDecimalStrWithDot
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new method is used on input after parsing the lexical form into a decimal - just before inlining. The Str Version ist used on output, after converting the decimal to string. I think that the Str Version could be removed in favor of Decimal.toPlainString() since the binary representation is now canonicalized on input.
Inserts/Deletes/Lookups of quads with non-canonical decimals now work reliably 🥳 As a side note, with the PR applied, our (streaming) sameAs reasoner now also works as expected - no more exceptions about lookups failing to detect the lexicographically least physical quad in the store upon which to enrich the graph.find stream with all the sameAs inferences. There used to be another similar issue with non-canonical lang tags such as SameAs inferencing using |
{ test("18.0", NodeFactory.createLiteralDT("18.0", XSDDatatype.XSDdecimal)); } | ||
|
||
@Test public void nodeId_decimal_20() | ||
{ testNodeIdRoundtrip(NodeValue.makeDecimal("18").asNode()); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use NodeFactory.createLiteralDT
, possibly via a new testNodeIdRoundtripDecimal(String)
.
NodeValue.makeDecimal
may have other effects; in fact, adding it into the round triple steps might be a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test cases are now updated to use testNodeIdRoundtripDecimal(String)
.
Is this after reloading the data? Or new old - existing TDB2 database?
Is that a typo for |
I needed to reload the offending graphs (I used 5.6.0-SNAPSHOT) - the TDB2 version used to load the data initially was around 4.8.0 or 4.9.0.
Right, my issue was not about the order of the components - but the use of lower case only. I was able to reproduce my issue with an older version of Jena:
On my local 4.7.0 the query returns |
aa4dfbb
to
44b7e22
Compare
Jena 5.4.0 introduced The RDF 1.1 spec says "language tags MAY be converted to lower case. The value space of language tags is always in lower case." This leads to differences across systems (when are we in "value space" for a language tag?) but also differences between Jena dataset implementations. Now RDF 1.2 Concepts says "MUST be treated accordingly, that is, in a case-insensitive manner." Jena went with the algorithm in RFC 5646 section 2.1.1, without regitry access, based on some previous user feedback. Language tag are now parsed or created as case-normalized following RFC 5646 form (lang is lowercase, region is uppercase) then comparison. The original language tag is not preserved. The WG did a survey of systems: it found systems providing lower case and systems providing the algorithm in RFC 5646 section 2.1.1. Jena went with the algorithm in RFC 5646 based on some previous user feedback. There is an entry in the RDF 1.2 change log. For initial text direction, the strings are CHANGES for jena 5.4.0:
|
44b7e22
to
1376cb7
Compare
I just updated the commit message for the PR - other than that I think its done(?) |
From your side - yes. I want to investigate the consequences, such as range which seems to be changed e.g. 15000000000000 (15e12), and conversely, very small numbers. |
Just my two cents: The exponential notation of decimals (15e12) in BigDecimal seems to be a mere feature of certain "toString" functions. https://www.w3.org/TR/xmlschema-2/#decimal-lexical-representation
As a side note, perhaps it is desired in TDB2 to expose non-canonical decimals directly with Without the change the legacy non-canonical decimals will be exposed as canonical ones and thus a delete will never match, thus eventually necessitating a data reload. |
The |
BGP matching is by comparing the 64bit binary The Jena v6 is, hopefully, the release after next. That would be a good time to make the change systematically and have a data reload or a conversion SPARQL update. The final jena 5 release could have "current" and "improved" (for you, mainly!) with "improved" needing to be explicitly enabled. (The user is responsible for not mixing them.) |
I don't have need for separate modes as I have upgraded my data to the canonical form already. I was just thinking about how other users could fix their data in case they run into this issue. But perhaps just recommending a reload with jena 6 is the easiest way. |
OK - I'll put this on a branch in the Jena GH repo. It can stay on the branch until jena6. It's not an area that is likely to cause conflicts up to jena6. (I have some code tidy to apply as well) |
Pulled and rebased to the current main on my working repo (a safer place than apache/jena ... just in case of a misstep). https://github.com/afs/jena/pull/new/decimal56 Continued on PR #3428 |
Thanks! Closjng this and continuing on #3428. |
Pull request Description:
Proposal to canonicalize decimals during inlining as TDB2 NodeIds. This way, the canonical form is consistently used for storing and matching.
By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.
See the Apache Jena "Contributing" guide.