Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify PICA serialization XML format #9

Open
nichtich opened this issue Dec 9, 2021 · 2 comments
Open

Specify PICA serialization XML format #9

nichtich opened this issue Dec 9, 2021 · 2 comments

Comments

@nichtich
Copy link
Contributor

nichtich commented Dec 9, 2021

Well, you created yet another PICA+ serialization format, so I would like to add its documentation to http://format.gbv.de/pica and support it in PICA::Data (see gbv/PICA-Data#83).

As far as I understand the script, PICA+ records are first transformed to XML with scripts/pica2xml.pl. There are examples of this XML format in scripts/test and in test. As far as I could analyze it, the format includes

  • root element collection with (optional?) attribute count
    • repeatable element record
      • element header with mandatory attribute status, having one of the values deleted or upsert
        • element identifier with the PPN
      • element metadata
        • repeatable element datafield with attributes tag, fulltag (mandatory) and occurrence (optional)
          • element subfield with mandatory attribute code
        • repeatable optional element item with mandatory attribute epn

Some files use a slightly different form

  • root element collection with (optional?) attribute count
    • repeatable element record
      • element status having one of the values deleted or upsert
      • element hrid with the PPN
      • element metadata
        • repeatable element datafield with attributes tag, fulltag (mandatory) and occurrence (optional)
          • element subfield with mandatory attribute code
        • repeatable optional element item with mandatory attribute epn
    • optional (?) element rawrecord with full record (syntax of this is another issue)

Questions:

  • why not PPXML or an extension (well I guess it's too late now)
  • why two variants? could both at least be consolidated?
  • what happens when a record contains multiple level 1 records? can datafield and item be mixed or is the format limited to one ILN?
  • why are x-occurrences not included in fulltag (e.g. "209Ax00/01" for field 209Ax/01 with $x=00). For some fields on level 2 subfield $x is crucial to distinguish the meaning of the field, see formal specification at https://format.gbv.de/schema/avram/specification#field-identifier
  • last but not least: what would be a proper name for the format? How about PICA Import XML (PIXML)?
@cledvina
Copy link
Collaborator

cledvina commented Dec 9, 2021

I'm not quite sure if even the pica2xml.pl script is up to date. I created that for a one-off PoC Leipzig University test.

Right. My former colleague, Heikki, who originally started the work on this project was creating his own flavor of xml from the pica text files. When I took over this project, I built upon his foundation. I did experiment with PICA::DATA at that time, but I think it turns out to be a little easier to transform and harvest with this current format. NOTE: Our harvester does need to have certain delete signals and identifiers in the header-- so this format leans toward OAI-PMH. I think the files with no header node are defunct now. There should only be one format. I apologize for not cleaning-up old stuff.

A record can have multiple level 1 data? We haven't come across this example when testing. Perhaps this is because this project only deals with single record updates.

I'm not sure why we are not using x-occurrences. This is probably because item data is in its own repeatable node.

An appropriate name for this format may be PICA FOLIO Import XML (PFIXML), or PIXML.

@nichtich
Copy link
Contributor Author

Thanks for the quick answer!

Our harvester does need to have certain delete signals and identifiers in the header [...] I think the files with no header node are defunct now

So the current format is the second one with element header and identifier instead of elements status and hrid?

A record can have multiple level 1 data? We haven't come across this example when testing. Perhaps this is because this project only deals with single record updates.

Yes, I think there is one FOLIO instance per level 1 identifier (ILN). The format could be extended to support more but if its use case does not need it, better keep it as it is.

I'm not sure why we are not using x-occurrences.

x-occurrence would make sense if fulltag is used to map PICA fields to FOLIO fields, otherwise an if statement on subfield value $x is needed to distinguish meaning of some of the fields with same tag and occurrence on level 2. On the other hand it is hard to tell automatically whether a field on level 2 has x-occurrence or not without lookup in the cataloging standard. So better keep it as it is: fulltag is just tag, optionally followed by / and occurence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants