GH-3404: Canonicalize decimals during inlining for TDB2. #3405

Aklakan · 2025-08-29T17:14:32Z

Pull request Description:

Proposal to canonicalize decimals during inlining as TDB2 NodeIds. This way, the canonical form is consistently used for storing and matching.

Tests are included.
Commits have been squashed to remove intermediate development commit messages.
Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.

See the Apache Jena "Contributing" guide.

Aklakan · 2025-08-29T17:35:22Z

Hm, I see that this PR causes TestNormalizationTDB2 to fail.
I am not sure whether non-canonical decimals should be returned in their canonical form by TDB2 (the test cases suggest: yes) or whether this should only be the case when e.g. canonicalize literals is enabled on the parser.

The alternative approach to make round-tripping work - hopefully without breaking tests - would be to canonicalize decimals during inlining. Then the NodeId that becomes stored would be that of the canonical decimal.

Aklakan · 2025-08-29T20:45:37Z

I added a pre-canonicalization step and both the existing tests and the new ones now pass.

afs · 2025-08-30T08:15:38Z

jena-arq/src/main/java/org/apache/jena/sparql/util/XSDNumUtils.java

    }

+    /** Return a canonical decimal with a trailing ".0". */
+    public static BigDecimal canonicalDecimalWithDot(BigDecimal decimal) {


How does this differ from canonicalDecimalStrWithDot?

This new method is used on input after parsing the lexical form into a decimal - just before inlining. The Str Version ist used on output, after converting the decimal to string. I think that the Str Version could be removed in favor of Decimal.toPlainString() since the binary representation is now canonicalized on input.

Aklakan · 2025-08-30T13:22:07Z

Inserts/Deletes/Lookups of quads with non-canonical decimals now work reliably 🥳

As a side note, with the PR applied, our (streaming) sameAs reasoner now also works as expected - no more exceptions about lookups failing to detect the lexicographically least physical quad in the store upon which to enrich the graph.find stream with all the sameAs inferences. There used to be another similar issue with non-canonical lang tags such as "foo"^^us-EN not matching in their retrieved form, but this has already been solved - most likely as part of the RDF1.2 work. So at least for now there are no more issues here for me.
We created this reasoner plugin because the performance of the owl reasoner and the rule engine was too bad and this approach let's one selectively enable reasoning for specific parts of a query.

SameAs inferencing using SERVICE <sameAs:> demo: https://api.triplydb.com/s/OI_O18TlJ

afs · 2025-08-30T16:03:13Z

jena-tdb2/src/test/java/org/apache/jena/tdb2/store/value/TestNodeIdInline.java

+    { test("18.0", NodeFactory.createLiteralDT("18.0", XSDDatatype.XSDdecimal)); }
+
+    @Test public void nodeId_decimal_20()
+    { testNodeIdRoundtrip(NodeValue.makeDecimal("18").asNode()); }


Better to use NodeFactory.createLiteralDT, possibly via a new testNodeIdRoundtripDecimal(String).

NodeValue.makeDecimal may have other effects; in fact, adding it into the round triple steps might be a good thing.

The test cases are now updated to use testNodeIdRoundtripDecimal(String).

afs · 2025-08-30T17:46:59Z

also works as expected

Is this after reloading the data? Or new old - existing TDB2 database?

"foo"^^us-EN

Is that a typo for "foo"@en-US (which is canonical). Jena keeps language tags in RFC case format, not lower case.

Aklakan · 2025-08-30T19:00:40Z

Is this after reloading the data?

I needed to reload the offending graphs (I used 5.6.0-SNAPSHOT) - the TDB2 version used to load the data initially was around 4.8.0 or 4.9.0.

Is that a typo for "foo"@en-us

Right, my issue was not about the order of the components - but the use of lower case only. I was able to reproduce my issue with an older version of Jena:

./tdb2.tdbupdate --loc /tmp/tdb 'INSERT DATA { <urn:s> <urn:p> "foo"@en-gb }'
./tdb2.tdbquery --loc /tmp/tdb 'SELECT * { { BIND(1 AS ?id) <urn:s> <urn:p> "foo"@en-gb } UNION { BIND(2 AS ?id) <urn:s> <urn:p> "foo"@en-GB } }'

On my local 4.7.0 the query returns {(?id=1)} wheras on 5.6.0-SNAPSHOT I get {(?id=1), (?id=2)}.
Also, SELECT ?o { ?s ?p ?o } returns "foo"@en-gb on 4.7.0 and "foo"@en-GB on 5.6.0-SNAPSHOT - but the upper-case language tag would not match on the legacy tdb2 version.

afs · 2025-08-31T10:29:14Z

Jena 5.4.0 introduced jena-langtag.

The RDF 1.1 spec says "language tags MAY be converted to lower case. The value space of language tags is always in lower case." This leads to differences across systems (when are we in "value space" for a language tag?) but also differences between Jena dataset implementations.

Now RDF 1.2 Concepts says "MUST be treated accordingly, that is, in a case-insensitive manner."

Jena went with the algorithm in RFC 5646 section 2.1.1, without regitry access, based on some previous user feedback. Language tag are now parsed or created as case-normalized following RFC 5646 form (lang is lowercase, region is uppercase) then comparison. The original language tag is not preserved.

The WG did a survey of systems: it found systems providing lower case and systems providing the algorithm in RFC 5646 section 2.1.1. Jena went with the algorithm in RFC 5646 based on some previous user feedback.

There is an entry in the RDF 1.2 change log.

For initial text direction, the strings are ltr or rtl and lower case is required.

CHANGES for jena 5.4.0:

== New artifact

Jena 5.4.0 introduces a new module jena-langtag for language tag parsing
in compliance with RFC 5646. Language tag validation is strengthened.
When parsing, language tag violations are still treated as warnings.

Aklakan · 2025-08-31T12:17:34Z

I just updated the commit message for the PR - other than that I think its done(?)

afs · 2025-09-02T16:14:03Z

I just updated the commit message for the PR - other than that I think its done(?)

From your side - yes.

I want to investigate the consequences, such as range which seems to be changed e.g. 15000000000000 (15e12), and conversely, very small numbers.

Aklakan · 2025-09-02T16:52:43Z

Just my two cents: The exponential notation of decimals (15e12) in BigDecimal seems to be a mere feature of certain "toString" functions. Decimal.toPlainString does not generate exponent fields.

https://www.w3.org/TR/xmlschema-2/#decimal-lexical-representation

3.2.3.1 Lexical representation
decimal has a lexical representation consisting of a finite-length sequence of decimal digits (#x30-#x39) separated by a period as a decimal indicator. An optional leading sign is allowed. If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(es) can be omitted. For example: -1.23, 12678967.543233, +100000.00, 210.

3.2.3.2 Canonical representation
The canonical representation for decimal is defined by prohibiting certain options from the Lexical representation (§3.2.3.1). Specifically, the preceding optional "+" sign is prohibited. The decimal point is required. Leading and trailing zeroes are prohibited subject to the following: there must be at least one digit to the right and to the left of the decimal point which may be a zero.

https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/math/BigDecimal.html#toPlainString()

public String toPlainString()
Returns a string representation of this BigDecimal without an exponent field. For values with a positive scale, the number of digits to the right of the decimal point is used to indicate scale. For values with a zero or negative scale, the resulting string is generated as if the value were converted to a numerically equal value with zero scale and as if all the trailing zeros of the zero scale value were present in the result. The entire string is prefixed by a minus sign character '-' ('\u002D') if the unscaled value is less than zero. No sign character is prefixed if the unscaled value is zero or positive.

As a side note, perhaps it is desired in TDB2 to expose non-canonical decimals directly with toPlainString instead of canonicalDecimalStrWithDot. The implication is that legacy stored non-canonical decimals would be exposed without the trailing .0 - i.e. a change of existing behavior. However, legacy data could then be upgraded with this sparql statement (at least in theory):
DELETE { ?s ?p ?o } INSERT { ?s ?p ?o } WHERE { ?s ?p ?o FILTER(datatype(?o) = <http://www.w3.org/2001/XMLSchema#decimal>) }

Without the change the legacy non-canonical decimals will be exposed as canonical ones and thus a delete will never match, thus eventually necessitating a data reload.

afs · 2025-09-02T17:41:24Z

The 15e12 was just to make it easier to read. It's an illegal lexical form for xsd:decimal.

afs · 2025-09-02T17:50:13Z

The implication is that legacy stored non-canonical decimals would be exposed without the trailing .0

BGP matching is by comparing the 64bit binary NodeId so there is a conversion on the way in from the query as well as one the way out.

The canonicalDecimalStrWithDot works with Turtle short form.

Jena v6 is, hopefully, the release after next. That would be a good time to make the change systematically and have a data reload or a conversion SPARQL update.

The final jena 5 release could have "current" and "improved" (for you, mainly!) with "improved" needing to be explicitly enabled. (The user is responsible for not mixing them.)

Aklakan · 2025-09-02T18:10:39Z

I don't have need for separate modes as I have upgraded my data to the canonical form already. I was just thinking about how other users could fix their data in case they run into this issue. But perhaps just recommending a reload with jena 6 is the easiest way.

afs · 2025-09-08T13:27:02Z

OK - I'll put this on a branch in the Jena GH repo.

It can stay on the branch until jena6. It's not an area that is likely to cause conflicts up to jena6.

(I have some code tidy to apply as well)

afs · 2025-09-08T13:46:39Z

Pulled and rebased to the current main on my working repo (a safer place than apache/jena ... just in case of a misstep).

https://github.com/afs/jena/pull/new/decimal56

Continued on PR #3428

afs · 2025-09-08T15:24:34Z

Thanks!

Closjng this and continuing on #3428.

Aklakan marked this pull request as draft August 29, 2025 17:25

Aklakan force-pushed the 20250829_fix_tdb2-decimal branch 2 times, most recently from 9500a18 to 890c10a Compare August 29, 2025 20:38

Aklakan marked this pull request as ready for review August 29, 2025 20:44

Aklakan force-pushed the 20250829_fix_tdb2-decimal branch from 890c10a to caa44c1 Compare August 30, 2025 03:10

afs reviewed Aug 30, 2025

View reviewed changes

Aklakan changed the title ~~GH-3404: Restore lexical forms of TDB2 NodeId-decimals as plain strings.~~ GH-3404: Canonicalize NodeId-decimals during inlining for TDB2. Aug 30, 2025

Aklakan changed the title ~~GH-3404: Canonicalize NodeId-decimals during inlining for TDB2.~~ GH-3404: Canonicalize decimals during inlining for TDB2. Aug 30, 2025

afs reviewed Aug 30, 2025

View reviewed changes

afs self-assigned this Aug 30, 2025

Aklakan force-pushed the 20250829_fix_tdb2-decimal branch 2 times, most recently from aa4dfbb to 44b7e22 Compare August 30, 2025 19:26

apacheGH-3404: Canonicalize decimals during inlining for TDB2.

1376cb7

Aklakan force-pushed the 20250829_fix_tdb2-decimal branch from 44b7e22 to 1376cb7 Compare August 31, 2025 12:14

afs mentioned this pull request Sep 8, 2025

GH-3404: Canonicalize decimals during inlining for TDB2 #3428

Draft

4 tasks

afs added the TDB label Sep 8, 2025

afs added the Jena6 Changes relating to Jena6 label Sep 8, 2025

afs closed this Sep 8, 2025

GH-3404: Canonicalize decimals during inlining for TDB2. #3405

GH-3404: Canonicalize decimals during inlining for TDB2. #3405

Uh oh!

Conversation

Aklakan commented Aug 29, 2025 • edited by afs Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aklakan commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aklakan commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afs Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

Aklakan Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aklakan commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afs Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aklakan Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

afs commented Aug 30, 2025

Uh oh!

Aklakan commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afs commented Aug 31, 2025

Uh oh!

Aklakan commented Aug 31, 2025

Uh oh!

afs commented Sep 2, 2025

Uh oh!

Aklakan commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afs commented Sep 2, 2025

Uh oh!

afs commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aklakan commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afs commented Sep 8, 2025

Uh oh!

afs commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afs commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aklakan commented Aug 29, 2025 •

edited by afs

Loading

Aklakan commented Aug 29, 2025 •

edited

Loading

Aklakan commented Aug 29, 2025 •

edited

Loading

Aklakan Aug 30, 2025 •

edited

Loading

Aklakan commented Aug 30, 2025 •

edited

Loading

afs Aug 30, 2025 •

edited

Loading

Aklakan Aug 30, 2025 •

edited

Loading

Aklakan commented Aug 30, 2025 •

edited

Loading

Aklakan commented Sep 2, 2025 •

edited

Loading

afs commented Sep 2, 2025 •

edited

Loading

Aklakan commented Sep 2, 2025 •

edited

Loading

afs commented Sep 8, 2025 •

edited

Loading