Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML namespaces for XPath #4

Closed
DylanVanAssche opened this issue Oct 11, 2021 · 24 comments
Closed

XML namespaces for XPath #4

DylanVanAssche opened this issue Oct 11, 2021 · 24 comments
Labels
proposal Proposal available for fixing this issue in the spec representation rml
Milestone

Comments

@DylanVanAssche
Copy link
Collaborator

XPath allows to use XML namespaces when selecting parts of an XML document.
However, (most) implementations require to register these namespaces before doing an XPath query.
RML does not specify how does this should happen currently:

  • In the mapping rules?
  • By the implementation with a CLI parameter or dynamically by parsing the XML document first and find any namespaces
  • ...

CARML has an extension for this: https://github.com/carml/carml#xml-namespace-extension
and it came up in the past already a few times without a clear solution:

@DylanVanAssche DylanVanAssche changed the title XML namespaces XML namespaces for XPath Oct 11, 2021
@dachafra
Copy link
Member

@DylanVanAssche is more this a challenge or a "best-practice" than a pure problem with the RML spec? Shall we transfer the issue?

@DylanVanAssche
Copy link
Collaborator Author

@dachafra For me, it is a spec thing because it is related to the rml:iterator. Maybe a Literal is insufficient here?

@dachafra
Copy link
Member

dachafra commented Mar 14, 2022

@DylanVanAssche So... seen as well the proposal from CARML, it is more related to the Logical Source, right? Do we transfer it to that spec?

@DylanVanAssche
Copy link
Collaborator Author

True! Fine for transferring it!

@DylanVanAssche DylanVanAssche transferred this issue from kg-construct/rml-core Mar 14, 2022
@DylanVanAssche
Copy link
Collaborator Author

@pmaria I like the CARML approach for this issue:

rml:logicalSource [
    rml:source [
      a carml:Stream ;
      # or in case of a file source use:
      # carml:url "path-to-source" ;
      carml:declaresNamespace [
        carml:namespacePrefix "ex" ;
        carml:namespaceName "http://www.example.com/books/1.0/" ;
      ] ;
    ] ;
    rml:referenceFormulation ql:XPath ;
    rml:iterator "/ex:bookstore/*" ;
  ] ;

What do you think of using this?

rml:logicalSource [
  rml:source [
    # Any kind of source
  ] ;
  rml:iterator [ a ql:XPathIterator, rml:Iterator;
    rml:namespaceName "http://www.example.com/books/1.0/" ;
    rml:namespacePrefix "ex" ;
    rml:value "/ex:bookstore/*";
  ];
]

Changes:

  • Make the iterator an object instead of literal, drop rml:referenceFormulation
  • Move namespaces to the iterator, especially the XPath iterator
  • For JSON, CSV, etc. we would have the same, just not the namespace stuff
  • In the future, reference formulation X appears which is totally different and needs some stuff like the namespaces as well, we can support it.

@DylanVanAssche DylanVanAssche added proposal Proposal available for fixing this issue in the spec rml representation labels Apr 1, 2022
@pmaria
Copy link
Contributor

pmaria commented Apr 1, 2022

Hmm I'm not sure the iterator is the most natural place to define the namespaces. Since you also want to be able to use these namespaces in non-iterator expressions.

@DylanVanAssche
Copy link
Collaborator Author

@pmaria

Hmm I'm not sure the iterator is the most natural place to define the namespaces. Since you also want to be able to use these namespaces in non-iterator expressions.

When you use rml:reference, rr:column, rr:template, etc. you take the rml:iterator value, append the value of one of these references to retrieve what you need in a Triples Map.
That's why I found it a better fit there because if it specify for the reference formulation & iterator.
rml:source is only for defining how a source should be accessed such as location. Because of that, I would keep the namespace declaration away from that since those namespaces are only used for executing the iterator & references during the data processing after the data was retrieved from the source.

@pmaria
Copy link
Contributor

pmaria commented Apr 5, 2022

When you use rml:reference, rr:column, rr:template, etc. you take the rml:iterator value, append the value of one of these references to retrieve what you need in a Triples Map.

Ah I don't see it that way necessarily. I see the rml:iterator, rml:reference, rr:template conceptually operating within the same scope/context. Wherein indeed, the iterator creates an iteration of sub documents on which the other expressions are evaluated. But I see the iterator as just another expression.

But I agree that source might not be the best place for the NS definition, because it is essentially a query concern, and the namespaces don't need to match the namespaces used in a source document.

maybe it makes more sense then to add a new object to the logical source, next to the iterator? Similar to your idea, but keeping iterator as is, i.e. as just another expression.

Something like rml:ExpressionContext.

rml:logicalSource [
  rml:source [
    # Any kind of source
  ] ;
  rml:iterator "/ex:bookstore/*" ;
  rml:expressionContext [ a XPathExpressionContext;
    rml:namespace [
       rml:namespaceName "http://www.example.com/books/1.0/" ;
       rml:namespacePrefix "ex" ;
    ];
  ]
  rml:referenceFormulation ql:XPath;
]

We could possibly combine it with the reference formulation? The rationale would be that this defines how to interpret the expressions that are based on a logical source.

@pmaria
Copy link
Contributor

pmaria commented Apr 5, 2022

So combining it with reference formulations could look like

rml:logicalSource [
  rml:source [
    # Any kind of source
  ] ;
  rml:iterator "/ex:bookstore/*" ;
  rml:referenceFormulation [ a ql:XPathReferenceFomulation;
    ql:namespace [
       ql:namespaceName "http://www.example.com/books/1.0/" ;
       ql:namespacePrefix "ex" ;
    ] ;
  ] ;
]

This would be a custom specified XPath reference formulation, next to the "default" ql:XPath.

@DylanVanAssche
Copy link
Collaborator Author

Ah I don't see it that way necessarily. I see the rml:iterator, rml:reference, rr:template conceptually operating within the same scope/context. Wherein indeed, the iterator creates an iteration of sub documents on which the other expressions are evaluated. But I see the iterator as just another expression.

Ah depends on how you implement the spec :) Some implementations do not create subdocuments.
However, I agree with you :)

But I agree that source might not be the best place for the NS definition, because it is essentially a query concern, and the namespaces don't need to match the namespaces used in a source document.

Yes! I try to separate the concerns as much as possible so it also re-usable in the future.

rml:referenceFormulation definition:

The reference formulation (rml:referenceFormulation) defines the reference formulation used to refer to the elements of the data source. The reference formulation must be specified in the case of databases and XML and JSON data sources. By default SQL2008 for databases, as SQL2008 is the default for R2RML, XPath for XML and JSONPath for JSON data sources.

According to the definition, the last suggestion looks better to me.
Are we aware of something similar for other reference formulations?

This would be a custom specified XPath reference formulation, next to the "default" ql:XPath.

Ideally, we don't even need that and have 1 IRI for both (with and without namespaces), but I'm not sure how to achieve that in RDF? Properties can be optional, but if you have none, it become something weird like this:

rml:referenceFormulation [ a ql:XPathReferenceFomulation; ] ;

We could 'solve' this by having shortcuts:

rml:referenceFormulation ql:XPath;

This shortcut points to [ a ql:XPathReferenceFomulation; ].
I think this is what you meant above with the "default"?

@pmaria
Copy link
Contributor

pmaria commented Apr 5, 2022

We could 'solve' this by having shortcuts:

rml:referenceFormulation ql:XPath;

Yes. I see that rml:ReferenceFormulation is already defined in the RML ontology.

rml:referenceFormulation rdfs:range rml:ReferenceFormulation .

rml:ReferenceFormulation rdf:type owl:Class ;
    rdfs:label   "Reference Formulation" ;
    rdfs:comment "Represents a Reference Formulation."@en .

And also defined is

ql:XPath rdf:type owl:NamedIndividual, rml:ReferenceFormulation  ;
    rdfs:label   "XPath" ; 
    rdfs:comment "Denotes the XPath reference formulation, used for referring to extracts of XML sources."@en ;
    ql:specification <http://www.w3.org/TR/xpath20/> ;
    rml:version "2.0".

So essentially the "shortcut" is just using the named individual.

Now all we would have to do is introduce a subclass of rml:ReferenceFormulation , rml:XPathReferenceFormulation, and define that further, adding namespace properties.

I don't think we should introduce a new named individual for XPath with namespaces. This would limit the namespaces you could define, since the individual's scope would be global. And you might want to define different namespaces per logical source.

@DylanVanAssche
Copy link
Collaborator Author

@pmaria Alright! I agree, let's setup our battle plan then for this issue:

  1. Introduce rml:XPathReferenceFormulation
  2. Define ql:namespaceName and ql:namespacePrefix in there

Problem solved then?

@pmaria
Copy link
Contributor

pmaria commented Apr 5, 2022

Yes I think so 🎉

Not forgetting ql:namespace to spec one or more ql:Namespaces

@DylanVanAssche DylanVanAssche added this to the v0.1 milestone Apr 5, 2022
@chrdebru
Copy link
Contributor

Why put namespace URIs in literals rather than using resources?

@DylanVanAssche
Copy link
Collaborator Author

@chrdebru

Why put namespace URIs in literals rather than using resources?

Spec: https://www.w3.org/TR/xml-names/

[URI references identifying namespaces are compared when determining whether a name belongs to a given namespace, and whether two names belong to the same namespace. Definition: The two URIs are treated as strings, and they are identical if and only if the strings are identical, that is, if they are the same sequence of characters. ] The comparison is case-sensitive, and no %-escaping is done or undone.

AFAIK, XML Namespaces are not like Linked Data and are compared through a string-based comparison without any resolving.
That's why they are a Literal here, but any insights are welcome!

@chrdebru
Copy link
Contributor

Yes, but they can also be regarded as named resources that can be described (no matter whether they dereference and resolve). Having those as resources would facilitate writing SPARQL queries and inverse property paths, for instance. Just a thought, not questioning the proposal.

I would suggest renaming ql:namespaceName to namespaceIRI. Some namespaces have titles and a namespace contains names. Turtle mentions this: "The '@Prefix' or 'PREFIX' directive associates a prefix label with an IRI".

@DylanVanAssche
Copy link
Collaborator Author

Yes, but they can also be regarded as named resources that can be described (no matter whether they dereference and resolve). Having those as resources would facilitate writing SPARQL queries and inverse property paths, for instance. Just a thought, not questioning the proposal.

I don't have much experience with that regard, so if it helps, I don't mind :)
For me, it doesn't really matter as long we have a mapping prefix <-> IRI

I would suggest renaming ql:namespaceName to namespaceIRI. Some namespaces have titles and a namespace contains names. Turtle mentions this: "The '@Prefix' or 'PREFIX' directive associates a prefix label with an IRI".

Hmmm true, twice 'name' might be a bit weird :)
@pmaria Do you agree on this?

@pmaria
Copy link
Contributor

pmaria commented Apr 12, 2022

Namespace name is what the spec calls it https://www.w3.org/TR/xml-names/#dt-NSName, so I would stick to that.

As far as I can tell we can't simply use IRIs, because XML expects URIs.

The main use case is to register the namespaces with an XPath engine for querying. Most implementations I've seen represent the namespace name as a string.

My feeling is that keeping it a string would be the more natural mapping to implementations, but if the arguments for using an IRI are strong I can live with that. We would however have to specify what happens when an IRI that is not a URI is used..

@andimou
Copy link

andimou commented Apr 23, 2022

I don't disagree with @chrdebru but we can as well keep it ql:namespace, whether IRI/URI or Literal can be determined based on the range, we don't need to include it in the name of the property.

Then again, if we include the restrictions in SHACL shapes, then we can decide on shape level iff it's string or IRI. There we can even provide 2 alternatives with 2 different explanations.

@andimou
Copy link

andimou commented Apr 23, 2022

Another thought, I debate myself. Newer libraries might read the namespaces from the file, would we still want to give the option to define the namespaces?

@pmaria
Copy link
Contributor

pmaria commented Apr 23, 2022

Another thought, I debate myself. Newer libraries might read the namespaces from the file, would we still want to give the option to define the namespaces?

In my experience this is not that trivial, especially in non-DOM based approaches, e.g. a streaming implementation. Namespaces can be defined inline in a document, so in theory a new namespace can be declared and used at the end of a document.

I have a strong preference to be able to declare this in the mapping. Tools can always also by default provide namespace detection as a service if it fits their architecture.

@DylanVanAssche
Copy link
Collaborator Author

I agree with @pmaria, extracting the XML namespaces is non trivial and may require consuming all XML first before any mapping can take place.

Then again, if we include the restrictions in SHACL shapes, then we can decide on shape level iff it's string or IRI. There we can even provide 2 alternatives with 2 different explanations.

SHACL can have an OR statement, but maybe to keep things straightforward we should have either a string or IRI, but not both?

@chrdebru
Copy link
Contributor

@pmaria if they call them namespace names, then OK!

@DylanVanAssche XML namespaces are declared in attributes (strings) in XML. So maybe that definition comes from their technical constraints. The advantage of IRIs is that "sameness" is implied when reused, whereas now you have to explicitly state that two namespace objects (if you can call them like that) as the same, or you infer them by comparing strings. So IRIs may help us in cases where we have different prefixes for the same namespace (e.g., combining mappings).

@DylanVanAssche
Copy link
Collaborator Author

@chrdebru I don't have a specific preference, except that I prefer either strings or IRIs, just not both ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Proposal available for fixing this issue in the spec representation rml
Projects
None yet
Development

No branches or pull requests

5 participants