Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PICA: filter out records #137

Closed
nichtich opened this issue Jun 13, 2022 · 7 comments
Closed

PICA: filter out records #137

nichtich opened this issue Jun 13, 2022 · 7 comments

Comments

@nichtich
Copy link
Collaborator

nichtich commented Jun 13, 2022

Some PICA records should be filtered out because they are internal, deleted... To ensure stable filter rules, these filters should be put/applied in the import script and executed with pica-rs.

Example: filter out "mailbox" records as they are only used internally (example must be run in script as bash expands ! if directly run at command line):

pica filter "[email protected] !~ '^a'" sample.dat
@nichtich
Copy link
Collaborator Author

nichtich commented Jun 15, 2022

Some more rules which records to filter out can be found here:

[email protected] !~ '^L' &&             #  Lokales Katalogisat (nur GBV)
[email protected] !~ '^..[iktN]' &&      #  Datensatz für internen Gebrauch, gelöscht, Verwaltungsdatensatz oder verdeckt
([email protected] !~ '^.v' || 021A.a?)  # Bandsatz ohne Titel

With this issue resolved in pica-rs, the filters can be read by pica-rs from a file with comments for documentation.

@nichtich
Copy link
Collaborator Author

Reduction of records to level 0 can also be done in this step with filter argument --reduce 0..... To reduce to level 0 and records with PPN (required), call:

pica filter --file ignore-records.filter '003@?' --reduce 0... input.dat > reduced.dat

and use file ignore-records.filter:

[email protected] !~ '^L' &&
[email protected] !~ '^..[iktN]' &&
([email protected] !~ '^.v' || 021A.a?)

@pkiraly
Copy link
Owner

pkiraly commented Jun 17, 2022

@nichtich

  • should [email protected] be interpret as the 0 subfield of 002@, so it is equivalent to 002@$0 in another notation?
  • what does 021A.a? mean?
  • !~ comes from Perl, meaning not matching, right?
  • In filter --file ignore-records.filter '003@?' the --file ignore-records.filter and '003@?' are both parameters of filter command, so it should pass only those records in which the first char of 002@$0 is not L, the third character is not i, k, t, or N and the second character is not N or it has 021A$a. I am a little bit confused with the '003@?' part. Should it be filter out or filter in?

@nichtich
Copy link
Collaborator Author

nichtich commented Jun 20, 2022

Yes [email protected] and 002@$0 is both subfield 0 of 002@ in PICA Path syntax. The latter is preferred for display and the former does not require shell escaping. If put in a file as suggested, the syntax with $ is better indeed.

021A.a? and !~ are pica-rs filter syntax, see here and here for documentation.

([email protected] !~ '^.v' || 021A.a?) encodes the rule to filter out records with position 2 of field 002@ subfield 0 beeing v unless the record has a field 021A with subfield a (this applies to some special records for serials).

I am a little bit confused with the '003@?' part. Should it be filter out or filter in?

Yes, that's confusing, it's ignored anyway because filter expression is read from file. A next version of pica-rs will get rid of this, so the colling syntax would become

pica filter --file ignore-records.filter --reduce 0... input.dat > reduced.dat

@pkiraly
Copy link
Owner

pkiraly commented Jun 20, 2022

For the time being I will not implement the exact syntax, but something which is parsable more easyly, and later I will get back to this feature. There is an existng feature --ignorableRecords which expects MARCSpec expressions separated by comma, which means OR, so a record is ignorable if expr1 OR expr2 OR ... exprN fits. If the input is PICA of course it expects PICA path.

@pkiraly
Copy link
Owner

pkiraly commented Jun 21, 2022

@nichtich I am working on this, and seems that the filter has the opposite meaning than that of --ignorableRecords.

filter "[email protected] !~ '^a'" -- lets in records which are not mailboxes
--ignorableRecords "[email protected] !~ '^a'" -- leaves out records which are not mailboxes

I see two solutions:

  1. reserve the operator on the client side, so when configuring !~ becomes =~. It is not compatible with filter, and needs some more development.
  2. add a new parameter --filterRecords which would be compatible with filter.

Which one do you prefer?

pkiraly added a commit that referenced this issue Jun 21, 2022
pkiraly added a commit that referenced this issue Jun 21, 2022
pkiraly added a commit that referenced this issue Jun 21, 2022
pkiraly added a commit that referenced this issue Jun 21, 2022
pkiraly added a commit that referenced this issue Jun 21, 2022
pkiraly added a commit that referenced this issue Jun 21, 2022
pkiraly added a commit that referenced this issue Jun 21, 2022
pkiraly added a commit that referenced this issue Jun 22, 2022
pkiraly added a commit that referenced this issue Jun 22, 2022
pkiraly added a commit that referenced this issue Jun 22, 2022
pkiraly added a commit that referenced this issue Jun 22, 2022
pkiraly added a commit that referenced this issue Jun 22, 2022
pkiraly added a commit that referenced this issue Jun 22, 2022
pkiraly added a commit that referenced this issue Jun 22, 2022
pkiraly added a commit that referenced this issue Jun 22, 2022
pkiraly added a commit that referenced this issue Jun 22, 2022
@nichtich
Copy link
Collaborator Author

Filtering out records should be left to pica-rs as far as possible. It's more reliable to have specialized tools for specific tasks, put together as modules.

pkiraly added a commit that referenced this issue Aug 15, 2022
@pkiraly pkiraly closed this as completed Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants