Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align filter field expression with PICA Path #579

Open
1 of 3 tasks
nichtich opened this issue Jan 26, 2023 · 9 comments
Open
1 of 3 tasks

Align filter field expression with PICA Path #579

nichtich opened this issue Jan 26, 2023 · 9 comments
Assignees
Labels
A-filter Area: The filter command C-documentation Category: documentation
Milestone

Comments

@nichtich
Copy link
Contributor

nichtich commented Jan 26, 2023

Working on #458 I realized that command filter does not fully conform to PICA Path yet. What's missing are:

  • patterns in tags
  • default occurrence for level 2
  • xtags (this is only relevant to level 2 fields so it could be postponed beyond version 1.0)

To support patterns in tags, a tag should be allowed to be given as any of [012.] [0-9.] [0-9.] [A-Z@.], for instance:

pica filter 044.    # get records with fields starting with 04
pica filter 1.../*  # get records with level 1 fields

Default occurrence for field expressions of level 2 should be /* because on level 2 the occurrence has a different role (counter). So these two should be identical:

pica filter 209C   # equals /*
pica filter 209C/*

Note that these are different on purpose:

pica filter 045R  # equals /00
pica filter 045R/*
@nwagner84
Copy link
Member

nwagner84 commented Jan 27, 2023

I guess the terms path and filter (matcher) are used in a different way. In pica-rs a path expression points always to subfield values (eg. 044[HK]{b == 'GND', 9} or 041A/*{7 =^ 'Ts' && 9?, 9}), which is used in the frequenecy command (or select). The filter uses predicates, which must be evaluate for a given record to true or false. In order get a boolean of a subfield value, a comparision operation must be performed (==, !=, ...). There is also a unary exists operater ? which can be used on field or subfield-level:

$ pica filter "044.?" DUMP.dat | pica count
$ pica filter "044[HK]?" DUMP.dat | pica count
$ pica filter "1.../*?" DUMP.dat | pica count
$ pica filter "1.../*.[3-8]?" DUMP.dat | pica count

In order to get your first examples to work, just a the ? operator.

The second part is not implemented. If there is a difference for occurrences on level two, this must be implemented.

I think the second part is not a problem for pica-rs. First, in pica-rs there is no /* occurrence; a field either contains a occurrence found in the data, or not. In a filter expression this occurrence can be matched against an occurrence matcher. There exists, among other, a any (/*) variant, which matches (returns true) in any case (occurrence or no occurrence). Now, it's up to the user to express the desired filter criterion, and use the any variant for occurrences of level 2 fields, if the occurrence don't care.

(NB: There is only one little exception, that an occurrence of /00 matches against the None and Some(/00) variant.)

@nichtich
Copy link
Contributor Author

I re-evaluated the current implementation when writing #588: use of patterns in tags is already supported as specified as PICA Path (by the way I might rename it to "PICA Path Expression" if this helps). The following remains:

  • Occurrences as single digit (e.g. 045R/1 should match 045R/01 and 045R/0-2 should match 045R, 045R/01 and 045R/02)
  • Default occurrence /* when field tag starts with 2 (e.g. 209C is equivalent to 209C/*)
  • xtags (see examples below) e.g. 209Ax01 is 209A{x=='1'} is 209A/*{x=='1'} and 209Ax0-1 is 209A{x in ['0','1']} is 209A/*{x in ['0','1']} . The sequence-syntax is not included in current PICA Path specification but used in K10plus.

@nwagner84
Copy link
Member

nwagner84 commented Jan 31, 2023 via email

@nichtich
Copy link
Contributor Author

nichtich commented Feb 1, 2023

One question regarding occurrences as single digits: Does 012A/1 should match 012A/001?

012A/001 does not exist as occurrence on level 1 and 2 is two digits. So 012A/1 should match 012A/01. Relaxing this to also let 012A/001 match the field seems ok.

On level 2 the occurrence can be two or three digits so 209A/1 and 209A/001 match 209A/01.

See https://unapi.k10plus.de/?&format=pp&id=opac-de-627:ppn:129435686 for an example of a record with occurrence on level 2 exceeding 99: it includes the field 209A/100 $B21/34$Dp$x00, an instance of this field.

This should be matched by any of:

  • 209A, 209A/*, 209Ax00, 209Ax00/*, 209Ax01-09, 209Ax00-09/*, 209A/*{x=='00'},
  • 209A/100, 209Ax00/100, 209Ax00-09/100, 209A/100{x in ['00','01','02','03','04','05','06','07','08','09']}

but not by 209Ax0 or 209A{x=='0'}!

By the way the K10plus format documentation uses the xtag-syntax 209A/$x00-09 but other applications use the syntax 209Ax00-09. I'll convince my colleauges to stick to the latter at least.

Sorry, I did not invent PICA+ format!

@nwagner84 nwagner84 self-assigned this Feb 1, 2023
@nwagner84 nwagner84 added this to the v1.0.0 milestone Feb 1, 2023
@nwagner84
Copy link
Member

I took some time and take a look at the current PICA Path specification and I think this is not the way pica-rs should follow. Despite, that the specification has some errors, it has too much valid expression where the semantic is not clear.

Just a few example: Is there a difference between 200Ax0 (valid tag expression 200A.[x0]) and 200Ax0 (valid xtag expression)? Also 047A/03- is a valid expression, which I've test with picadata against a record containing 047A/03. I would interpret this as 03 to infinity and had expected that picadata matches against this field, but it didn't. As an last example: 200Ax9/**/-3, at least for me, this is not readable anymore.

But most important, implementing the missing feature and extensions would make the parser code more complicated and less maintainable. And even more important I want to provide my colleagues an clear syntax with a clear semantic, which is oriented on use-cases and real data. It simply does not make sense to allow numbers with more than three digits in and range expression, when occurrences can have only two or three digits.

I see two options to proceed: pica-rs uses an other term than "path", or the PICA Path specification gets revised and a new, independent version 2.0 is released. This new version must specify a minimal core set with multiple extensions and a clear description of the syntax and semantic.

@nichtich
Copy link
Contributor Author

nichtich commented Feb 3, 2023

Please don't follow xkcd 927. The current specification is based on use-cases working with real PICA+ data since more than a decade. It's complexity and apparent errors arise from the need to support multiple viewpoints and applications. pica-rs is not the first of these applications and it will not be the last because people continously create ad-hoc implementation based on what they already know and what is easy to implement for them.

I see two options to proceed: pica-rs uses an other term than "path", or the PICA Path specification gets revised and a new, independent version 2.0 is released. This new version must specify a minimal core set with multiple extensions and a clear description of the syntax and semantic.

So let's go for option 2. I agree that the current document at https://format.gbv.de/query/picapath is far from clear, finished and easy to understand. First of all please ignore all extensions. pica-rs will support its own extension of PICA Path for sure and these don't need to be named extensions or "path" at all. Just make sure that pica-rs accepts interprets every core PICA Path Expression as a full subset to basic interoperability between applications.

The absolute minimum baseline is the subset described here (including this restriction) to reference fields with common definition in cataloging rules:

field      ::=  tag01 ( "/" occurrence )? | tag2 ( "x" counter )?
tag01      ::=  [01] [0-9] [0-9] [A-Z@]
tag2       ::=  "2" [0-9] [0-9] [A-Z@]
occurrence ::=  [0-9] [0-9] ( "-" [0-9] [0-9] )?
counter    ::=  [0-9] ( "-" [0-9] )? | [0-9] [0-9] ( "-" [0-9] [0-9] )?

With the additional rule that occurrence zero (/00) must be mapped to no occurrence.

This can be simplified and extended, e.g.

field      ::=  tag ( "x" counter )? ( "/" occurrence )?
tag        ::=  [012.] [0-9.] [0-9.] [A-Z@.]
occurrence ::=  [0-9]+ ( "-" [0-9]+ )? | "*"
counter    ::=  [0-9]+ ( "-" [0-9]+ )?

With the additional rule that occurrence is set to /* by default if tag starts with 2. I would further limit counter to level zero and one as included in the first grammar with tag01 and tag2 there are mulitple ways to implement this.

Finally add minimum support of subfields:

path             ::=  field subfields?
subfields        ::=  "$" [A-Za-z0-9]+

That's all. Everything beyond, e.g. to include the subfield indicator . in addition to $ is your choice and I would like to include some of your great additions to PICA Path specification.

Is there a difference between 200Ax0 (valid tag expression 200A.[x0]) and 200Ax0 (valid xtag expression)

Good spot. I'm not happy with xtags at all, but that's how PICA is used. My colleauges introduced 200A/$x0 as yet another syntax for xtags (but only in some applications), I don't think this helps. How about making the subfield indicator (. or $) mandatory?

047A/03- is a valid expression, which I've test with picadata against a record containing 047A/03.

This is just a (possibly buggy) picadata extension of PICA Path, please ignore.

As an last example: 200Ax9/**/-3 at least for me, this is not readable anymore.

This is another PICA Path extension I don't like neither, please ignore.

Sorry for not being more clear about what actually makes the core of PICA Patch and what is an optional extension to satisfy independent applications. Let's keep it simple but assure basic interoperability.

@nichtich
Copy link
Contributor Author

nichtich commented Jul 3, 2023

As mentioned in the inital issue description, support of xtags could be postponed beyond version 1.0 and discussed in another issue. For release 1.0 of pica-rs the only change needed for compatibility with other PICA tools, is the default occurrence of level 2 tags if no occurrence is specified: e.g. pica filter 209C now equals pica filter 209C/00 but it should equal pica filter 209C/*). Rationale: occurrences on level 0 and 1 identitfy the field but on level 2 occurrences are counters to group level 2 records. In contrast to level 0 und 1, the occurrence in level 2 fields does not change the semantics of a field (I would not be surprised if occurrences of level 2 fields can change, e.g. when a level 2 record is deleted, remaining level 2 records change their occurrence value).

@nwagner84
Copy link
Member

nwagner84 commented Jul 3, 2023

As part of the stabilization I simplified and cleaned up the syntax of pica-rs path expression (ex. removed lazy- and $-syntax). Now, I'm almost happy with the result and I would like to keep things as simple as possible.

How to proceed with this issue?

I'll discuss this issue with our DNB user group (next meeting 28.07.) and let you know what they think and how we proceed.

@nichtich
Copy link
Contributor Author

nichtich commented Jul 3, 2023

Apart from the goal of compatibility across institutions and tools for PICA data, the following arguments may help colleauges from DNB to decide:

  • DNB uses the syntax forms 245Z/XX and 244Y to refer to PICA level 2 fields in its documentation. The second variant (read 244Y as 244Y/*) is also the one used in K10plus documentation.
  • Occurrence of PICA level 2 tags start from 01, so reading 244Y as 244Y/00 (as implemented now in pica-rs, in contrast to other tools) would never catch any fields anyway.

@nwagner84 nwagner84 removed this from the v1.0.0 milestone Sep 15, 2023
@nwagner84 nwagner84 assigned nwagner84 and unassigned nwagner84 Sep 1, 2024
@nwagner84 nwagner84 added C-documentation Category: documentation A-filter Area: The filter command labels Sep 1, 2024
@nwagner84 nwagner84 added this to the v1.0.0 milestone Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-filter Area: The filter command C-documentation Category: documentation
Projects
None yet
Development

No branches or pull requests

2 participants