Skip to content

Conversation

Aklakan
Copy link
Contributor

@Aklakan Aklakan commented Aug 29, 2025

Pull request Description:

Proposal to canonicalize decimals during inlining as TDB2 NodeIds. This way, the canonical form is consistently used for storing and matching.


  • Tests are included.
  • Commits have been squashed to remove intermediate development commit messages.
  • Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

@Aklakan Aklakan marked this pull request as draft August 29, 2025 17:25
@Aklakan
Copy link
Contributor Author

Aklakan commented Aug 29, 2025

Hm, I see that this PR causes TestNormalizationTDB2 to fail.
I am not sure whether non-canonical decimals should be returned in their canonical form by TDB2 (the test cases suggest: yes) or whether this should only be the case when e.g. canonicalize literals is enabled on the parser.

The alternative approach to make round-tripping work - hopefully without breaking tests - would be to canonicalize decimals during inlining. Then the NodeId that becomes stored would be that of the canonical decimal.

@Aklakan Aklakan force-pushed the 20250829_fix_tdb2-decimal branch 2 times, most recently from 9500a18 to 890c10a Compare August 29, 2025 20:38
@Aklakan Aklakan marked this pull request as ready for review August 29, 2025 20:44
@Aklakan
Copy link
Contributor Author

Aklakan commented Aug 29, 2025

I added a pre-canonicalization step and both the existing tests and the new ones now pass.

@Aklakan Aklakan force-pushed the 20250829_fix_tdb2-decimal branch from 890c10a to caa44c1 Compare August 30, 2025 03:10
}

/** Return a canonical decimal with a trailing ".0". */
public static BigDecimal canonicalDecimalWithDot(BigDecimal decimal) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this differ from canonicalDecimalStrWithDot?

Copy link
Contributor Author

@Aklakan Aklakan Aug 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new method is used on input after parsing the lexical form into a decimal - just before inlining. The Str Version ist used on output, after converting the decimal to string. I think that the Str Version could be removed in favor of Decimal.toPlainString() since the binary representation is now canonicalized on input.

@Aklakan
Copy link
Contributor Author

Aklakan commented Aug 30, 2025

Inserts/Deletes/Lookups of quads with non-canonical decimals now work reliably 🥳

As a side note, with the PR applied, our (streaming) sameAs reasoner now also works as expected - no more exceptions about lookups failing to detect the lexicographically least physical quad in the store upon which to enrich the graph.find stream with all the sameAs inferences. There used to be another similar issue with non-canonical lang tags such as "foo"^^us-EN not matching in their retrieved form, but this has already been solved - most likely as part of the RDF1.2 work. So at least for now there are no more issues here for me.
We created this reasoner plugin because the performance of the owl reasoner and the rule engine was too bad and this approach let's one selectively enable reasoning for specific parts of a query.

SameAs inferencing using SERVICE <sameAs:> demo: https://api.triplydb.com/s/OI_O18TlJ

@Aklakan Aklakan changed the title GH-3404: Restore lexical forms of TDB2 NodeId-decimals as plain strings. GH-3404: Canonicalize NodeId-decimals during inlining for TDB2. Aug 30, 2025
@Aklakan Aklakan changed the title GH-3404: Canonicalize NodeId-decimals during inlining for TDB2. GH-3404: Canonicalize decimals during inlining for TDB2. Aug 30, 2025
{ test("18.0", NodeFactory.createLiteralDT("18.0", XSDDatatype.XSDdecimal)); }

@Test public void nodeId_decimal_20()
{ testNodeIdRoundtrip(NodeValue.makeDecimal("18").asNode()); }
Copy link
Member

@afs afs Aug 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use NodeFactory.createLiteralDT, possibly via a new testNodeIdRoundtripDecimal(String).

NodeValue.makeDecimal may have other effects; in fact, adding it into the round triple steps might be a good thing.

Copy link
Contributor Author

@Aklakan Aklakan Aug 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases are now updated to use testNodeIdRoundtripDecimal(String).

@afs afs self-assigned this Aug 30, 2025
@afs
Copy link
Member

afs commented Aug 30, 2025

also works as expected

Is this after reloading the data? Or new old - existing TDB2 database?

"foo"^^us-EN

Is that a typo for "foo"@en-US (which is canonical). Jena keeps language tags in RFC case format, not lower case.

@Aklakan
Copy link
Contributor Author

Aklakan commented Aug 30, 2025

Is this after reloading the data?

I needed to reload the offending graphs (I used 5.6.0-SNAPSHOT) - the TDB2 version used to load the data initially was around 4.8.0 or 4.9.0.

Is that a typo for "foo"@en-us

Right, my issue was not about the order of the components - but the use of lower case only. I was able to reproduce my issue with an older version of Jena:

./tdb2.tdbupdate --loc /tmp/tdb 'INSERT DATA { <urn:s> <urn:p> "foo"@en-gb }'
./tdb2.tdbquery --loc /tmp/tdb 'SELECT * { { BIND(1 AS ?id) <urn:s> <urn:p> "foo"@en-gb } UNION { BIND(2 AS ?id) <urn:s> <urn:p> "foo"@en-GB } }'

On my local 4.7.0 the query returns {(?id=1)} wheras on 5.6.0-SNAPSHOT I get {(?id=1), (?id=2)}.
Also, SELECT ?o { ?s ?p ?o } returns "foo"@en-gb on 4.7.0 and "foo"@en-GB on 5.6.0-SNAPSHOT - but the upper-case language tag would not match on the legacy tdb2 version.

@Aklakan Aklakan force-pushed the 20250829_fix_tdb2-decimal branch 2 times, most recently from aa4dfbb to 44b7e22 Compare August 30, 2025 19:26
@afs
Copy link
Member

afs commented Aug 31, 2025

Jena 5.4.0 introduced jena-langtag.

The RDF 1.1 spec says "language tags MAY be converted to lower case. The value space of language tags is always in lower case." This leads to differences across systems (when are we in "value space" for a language tag?) but also differences between Jena dataset implementations.

Now RDF 1.2 Concepts says "MUST be treated accordingly, that is, in a case-insensitive manner."

Jena went with the algorithm in RFC 5646 section 2.1.1, without regitry access, based on some previous user feedback. Language tag are now parsed or created as case-normalized following RFC 5646 form (lang is lowercase, region is uppercase) then comparison. The original language tag is not preserved.

The WG did a survey of systems: it found systems providing lower case and systems providing the algorithm in RFC 5646 section 2.1.1. Jena went with the algorithm in RFC 5646 based on some previous user feedback.

There is an entry in the RDF 1.2 change log.

For initial text direction, the strings are ltr or rtl and lower case is required.

CHANGES for jena 5.4.0:

== New artifact

Jena 5.4.0 introduces a new module jena-langtag for language tag parsing
in compliance with RFC 5646. Language tag validation is strengthened.
When parsing, language tag violations are still treated as warnings.

@Aklakan Aklakan force-pushed the 20250829_fix_tdb2-decimal branch from 44b7e22 to 1376cb7 Compare August 31, 2025 12:14
@Aklakan
Copy link
Contributor Author

Aklakan commented Aug 31, 2025

I just updated the commit message for the PR - other than that I think its done(?)

@afs
Copy link
Member

afs commented Sep 2, 2025

I just updated the commit message for the PR - other than that I think its done(?)

From your side - yes.

I want to investigate the consequences, such as range which seems to be changed e.g. 15000000000000 (15e12), and conversely, very small numbers.

@Aklakan
Copy link
Contributor Author

Aklakan commented Sep 2, 2025

Just my two cents: The exponential notation of decimals (15e12) in BigDecimal seems to be a mere feature of certain "toString" functions. Decimal.toPlainString does not generate exponent fields.

https://www.w3.org/TR/xmlschema-2/#decimal-lexical-representation

3.2.3.1 Lexical representation
decimal has a lexical representation consisting of a finite-length sequence of decimal digits (#x30-#x39) separated by a period as a decimal indicator. An optional leading sign is allowed. If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(es) can be omitted. For example: -1.23, 12678967.543233, +100000.00, 210.

3.2.3.2 Canonical representation
The canonical representation for decimal is defined by prohibiting certain options from the Lexical representation (§3.2.3.1). Specifically, the preceding optional "+" sign is prohibited. The decimal point is required. Leading and trailing zeroes are prohibited subject to the following: there must be at least one digit to the right and to the left of the decimal point which may be a zero.

https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/math/BigDecimal.html#toPlainString()

public String toPlainString()
Returns a string representation of this BigDecimal without an exponent field. For values with a positive scale, the number of digits to the right of the decimal point is used to indicate scale. For values with a zero or negative scale, the resulting string is generated as if the value were converted to a numerically equal value with zero scale and as if all the trailing zeros of the zero scale value were present in the result. The entire string is prefixed by a minus sign character '-' ('\u002D') if the unscaled value is less than zero. No sign character is prefixed if the unscaled value is zero or positive.

As a side note, perhaps it is desired in TDB2 to expose non-canonical decimals directly with toPlainString instead of canonicalDecimalStrWithDot. The implication is that legacy stored non-canonical decimals would be exposed without the trailing .0 - i.e. a change of existing behavior. However, legacy data could then be upgraded with this sparql statement (at least in theory):
DELETE { ?s ?p ?o } INSERT { ?s ?p ?o } WHERE { ?s ?p ?o FILTER(datatype(?o) = <http://www.w3.org/2001/XMLSchema#decimal>) }

Without the change the legacy non-canonical decimals will be exposed as canonical ones and thus a delete will never match, thus eventually necessitating a data reload.

@afs
Copy link
Member

afs commented Sep 2, 2025

The 15e12 was just to make it easier to read. It's an illegal lexical form for xsd:decimal.

@afs
Copy link
Member

afs commented Sep 2, 2025

The implication is that legacy stored non-canonical decimals would be exposed without the trailing .0

BGP matching is by comparing the 64bit binary NodeId so there is a conversion on the way in from the query as well as one the way out.

The canonicalDecimalStrWithDot works with Turtle short form.

Jena v6 is, hopefully, the release after next. That would be a good time to make the change systematically and have a data reload or a conversion SPARQL update.

The final jena 5 release could have "current" and "improved" (for you, mainly!) with "improved" needing to be explicitly enabled. (The user is responsible for not mixing them.)

@Aklakan
Copy link
Contributor Author

Aklakan commented Sep 2, 2025

I don't have need for separate modes as I have upgraded my data to the canonical form already. I was just thinking about how other users could fix their data in case they run into this issue. But perhaps just recommending a reload with jena 6 is the easiest way.

@afs
Copy link
Member

afs commented Sep 8, 2025

OK - I'll put this on a branch in the Jena GH repo.

It can stay on the branch until jena6. It's not an area that is likely to cause conflicts up to jena6.

(I have some code tidy to apply as well)

@afs
Copy link
Member

afs commented Sep 8, 2025

Pulled and rebased to the current main on my working repo (a safer place than apache/jena ... just in case of a misstep).

https://github.com/afs/jena/pull/new/decimal56

Continued on PR #3428

@afs afs added the Jena6 Changes relating to Jena6 label Sep 8, 2025
@afs
Copy link
Member

afs commented Sep 8, 2025

Thanks!

Closjng this and continuing on #3428.

@afs afs closed this Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Jena6 Changes relating to Jena6 TDB

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants