Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing - Authority record samples #1431

Open
3 of 4 tasks
ahafele opened this issue Nov 12, 2024 · 18 comments · Fixed by #1489
Open
3 of 4 tasks

Testing - Authority record samples #1431

ahafele opened this issue Nov 12, 2024 · 18 comments · Fixed by #1489
Assignees

Comments

@ahafele
Copy link

ahafele commented Nov 12, 2024

  • Set up ebsco tools/airflow
  • Test load authority record samples
    • new - see comments below
    • updates - Per Jeremy: not possible with migration tools

Some records have been loaded to stage already

Sample files here

@ahafele ahafele added Authorities data export Related to Data Export out of FOLIO to external vendors and removed data export Related to Data Export out of FOLIO to external vendors labels Nov 12, 2024
@ahafele
Copy link
Author

ahafele commented Nov 15, 2024

Documenting here differences in ebsco loaded authority records vs. data import

Jeremy posted a record using the ebsco tools to both /authority-storage/authorities and SRS
GET authority-storage/authorities/71b2cacf-4f96-5154-ae35-6e6a30986683 results in

{
    "id": "71b2cacf-4f96-5154-ae35-6e6a30986683",
    "_version": 0,
    "source": "MARC",
    "personalName": "Grimm, Wilhelm, 1786-1859",
    "sftPersonalName": [
        "Grim, Vilkhelm, 1786-1859",
        "Grimm, Guglielmo, 1786-1859",
        "Grimm, Vilʹgelʹm Karl, 1786-1859",
        "Grimm, Wilhelm Karl, 1786-1859",
        "Grimm Brothers",
        "Brothers Grimm",
        "Brüder Grimm",
        "Bratʹi͡a Grimm",
        "Braty Grimm",
        "Krim eghbayrner",
        "Гримм, Вильгельм, 1786-1859",
        "ברודער גרים",
        "גרים וילהלם",
        "גרים, ווילהלם",
        "גרים, וילהלם",
        "גרים, וילהלם, 1786־1859",
        "גרים, וילהלם, 1859־1786"
    ],
    "sftPersonalNameTitle": [
        "Grim, Vilkhelm, 1786-1859",
        "Grimm, Guglielmo, 1786-1859",
        "Grimm, Vilʹgelʹm Karl, 1786-1859",
        "Grimm, Wilhelm Karl, 1786-1859",
        "Grimm Brothers",
        "Brothers Grimm",
        "Brüder Grimm",
        "Bratʹi͡a Grimm",
        "Braty Grimm",
        "Krim eghbayrner",
        "Гримм, Вильгельм, 1786-1859",
        "ברודער גרים",
        "גרים וילהלם",
        "גרים, ווילהלם",
        "גרים, וילהלם",
        "גרים, וילהלם, 1786־1859",
        "גרים, וילהלם, 1859־1786"
    ],
    "identifiers": [
        {
            "value": "n  78095679",
            "identifierTypeId": "c858e4f2-2b6b-4385-842b-60732ee14abb"
        },
        {
            "value": "(OCoLC)oca00230755",
            "identifierTypeId": "7e591197-f335-4afb-bc6d-a6d76ca3bace"
        },
        {
            "value": "(DLC)n  78095679",
            "identifierTypeId": "7e591197-f335-4afb-bc6d-a6d76ca3bace"
        }
    ],
    "notes": [
        {
            "noteTypeId": "76c74801-afec-45a0-aad7-3ff23591e147",
            "note": "Machine-derived non-Latin script reference project."
        },
        {
            "noteTypeId": "76c74801-afec-45a0-aad7-3ff23591e147",
            "note": "Non-Latin script references not evaluated."
        },
        {
            "noteTypeId": "76c74801-afec-45a0-aad7-3ff23591e147",
            "note": "Brothers Grimm/Brüder Grimm not considered joint pseudonym of Wilhelm Grimm and Jacob Grimm; evidence indicates that original works were not issued under this name."
        }
    ],
    "sourceFileId": "af045f2f-e851-4613-984c-4bc13430454a",
    "naturalId": "n78095679",
    "metadata": {
        "createdDate": "2024-11-14T17:55:50.85306Z",
        "createdByUserId": "58d0aaf6-dcda-4d5e-92da-012e6b7dd766",
        "updatedDate": "2024-11-14T17:55:50.85306Z",
        "updatedByUserId": "58d0aaf6-dcda-4d5e-92da-012e6b7dd766"
    }
}

I loaded the same record to data import with the default authorities profile and did a
GET authority-storage/authorities/fa83bb1b-23af-4b44-b5ed-33a918e04594

{
    "id": "fa83bb1b-23af-4b44-b5ed-33a918e04594",
    "_version": 0,
    "source": "MARC",
    "personalName": "Grimm, Wilhelm, 1786-1859",
    "sftPersonalName": [
        "Grim, Vilkhelm, 1786-1859",
        "Grimm, Guglielmo, 1786-1859",
        "Grimm, Vilʹgelʹm Karl, 1786-1859",
        "Grimm, Wilhelm Karl, 1786-1859",
        "Grimm Brothers",
        "Brothers Grimm",
        "Brüder Grimm",
        "Bratʹi͡a Grimm",
        "Braty Grimm",
        "Krim eghbayrner",
        "Гримм, Вильгельм, 1786-1859",
        "ברודער גרים",
        "גרים וילהלם",
        "גרים, ווילהלם",
        "גרים, וילהלם",
        "גרים, וילהלם, 1786־1859",
        "גרים, וילהלם, 1859־1786"
    ],
    "subjectHeadings": "a",
    "identifiers": [
        {
            "value": "n  78095679",
            "identifierTypeId": "5d164f4b-0b15-4e42-ae75-cfcf85318ad9"
        },
        {
            "value": "n  78095679",
            "identifierTypeId": "c858e4f2-2b6b-4385-842b-60732ee14abb"
        },
        {
            "value": "(OCoLC)oca00230755",
            "identifierTypeId": "fe19bae4-da28-472b-be90-d442e2428ead"
        }
    ],
    "notes": [
        {
            "noteTypeId": "76c74801-afec-45a0-aad7-3ff23591e147",
            "note": "Machine-derived non-Latin script reference project."
        },
        {
            "noteTypeId": "76c74801-afec-45a0-aad7-3ff23591e147",
            "note": "Non-Latin script references not evaluated."
        },
        {
            "noteTypeId": "76c74801-afec-45a0-aad7-3ff23591e147",
            "note": "Brothers Grimm/Brüder Grimm not considered joint pseudonym of Wilhelm Grimm and Jacob Grimm; evidence indicates that original works were not issued under this name."
        }
    ],
    "sourceFileId": "af045f2f-e851-4613-984c-4bc13430454a",
    "naturalId": "n78095679",
    "metadata": {
        "createdDate": "2024-11-14T18:12:52.88211Z",
        "createdByUserId": "ffba9979-3f5d-4aac-a74f-18218dd2573f",
        "updatedDate": "2024-11-14T18:12:52.88211Z",
        "updatedByUserId": "ffba9979-3f5d-4aac-a74f-18218dd2573f"
    }
}

Linking worked for the data import record and the ebsco loaded record.
Interestingly the esbco loaded record isn't not reflected as LCSH in the thesaurus facet, but I think this is unrelated since that uses the 008.
Other differences include the identifierTypeId for the OCLC numbers and ebsco includes sftPersonalNameTitle but the data loaded record is sftPersonalName.

@ahafele
Copy link
Author

ahafele commented Nov 18, 2024

The $t is being ignored in the ebsco handling of 4xx fields. There is an open ticket from 2022 in the folio-migration-tools to support requiredSubfield

I haven't figured out exactly what is going on with the ID differences but one is that the migration tools are concatenating the 001 and 003
https://github.com/FOLIO-FSE/folio_migration_tools/blob/719e0c0a4175a1716d58eb768c[…]folio_migration_tools/marc_rules_transformation/hrid_handler.py

Something is missing regarding the 008 handling as well.

@ahafele
Copy link
Author

ahafele commented Dec 3, 2024

Findings outlined here: https://docs.google.com/document/d/1W1oqZqhWcw7JZEeSCg6rKsJys6BWId5ftKJ2e6D2Inw/edit?tab=t.0#bookmark=kix.337n8d4ieos7

I think our options are

  1. Make the needed changes in the Ebsco tooling to match what is done through Data Import
    @jermnelson
  2. Use Data Import to load the initial Authority file(s). We only know of one institution that has done this - Michigan. They loaded 2.5 million and said was time consuming and error prone.
  3. Write something in airflow to post MARC to SRS and create/post json to authority-storage/authorities. Not sure which APIs would be used for this. Would need to process the records in the same way Data Import does, e.g. use Data Import processing core rules and mapping.

@jermnelson what are your thoughts on the above?

@jermnelson
Copy link
Collaborator

Hi Alissa, I believe a combination of 1 and 3 could be an option. In an Airflow DAG we would do the following tasks:

  1. Save the MARC Authority file to the Airflow server
  2. For each authority record in the MARC Authority file, generate an initial Authority JSON record using the Ebsco Authority Transformer.
  3. Modify the Authority JSON record based Data Import processing rules and mapping (this might be easier than trying to update the Ebsco tooling)
  4. Append each authority record to a JSON line file
  5. Use the FOLIO Batch API to upload the JSON and MARC files

@ahafele
Copy link
Author

ahafele commented Dec 3, 2024

Thanks for thinking this through @jermnelson. Couple questions off the top of my head for steps 2-3

  • is there the possibility for data loss here? e.g. could we run into a situation in which the Ebsco generated JSON does not include data we would want before step 3 occurs?
  • based on our previous testing for the specific example records we loaded we have a sense of what the data import processing rules are doing differently from the Authority Transformer, but I assume there could be other differences we haven't noticed yet. Any ideas for how to understand this more systematically?

@jermnelson
Copy link
Collaborator

* is there the possibility for data loss here? e.g. could we run into a situation in which the Ebsco generated JSON does not include data we would want before step 3 occurs?

Since Ebsco's AuthorityTransformer uses FOLIO Authority Mappings to generate the Authority JSON record (although as we've discovered the Ebsco mapping code doesn't support all of the different options in the FOLIO mapping like requiredSubfield), I think the mapped MARC data should be present in the Authority record. I'm working on a report that takes the sample Authority MARC records and using the AuthorityTransformer, create Authority JSON records and then compares the FOLIO Mapping to see if the expected fields are present in the records.

* based on our previous testing for the specific example records we loaded we have a sense of what the data import processing rules are doing differently from the Authority Transformer, but I assume there could be other differences we haven't noticed yet. Any ideas for how to understand this more systematically?

I'm not sure how to approach this question without having the corresponding Authority records created by data import to compare with the records created by the AuthorityTransformer. I'll do some more analysis and maybe see if I can extend the reporting from the previous question to this question.

@ahafele
Copy link
Author

ahafele commented Dec 5, 2024

Great, thanks Jeremy! Let me know if you want a number of records loaded through Data Import to compare.

@jermnelson
Copy link
Collaborator

Here is a spreadsheet that has the Authority Mapping MARC tags* as columns. Each row is a single authority record and the if the MARC field is present and has a value, it is set to True, if the value is missing, the field is set to False. If the tag isn't in the MARC record, a blank (null) value is record. From this analysis, the Ebsco AuthorityMapper always generates a value if a tag is present in the MARC record.

mapping-of-records.csv

*The Ebsco AuthorityMapper class has special handling for the 001 and 008 fields and are excluded, the 001 value is included as the first column to match to the corresponding MARC record.

@ahafele
Copy link
Author

ahafele commented Dec 10, 2024

@jermnelson thanks for doing this! I've reviewed the spreadsheet and agree all looks good - confirms all data would be present in the json with exceptions for 008 and 001 that will need to be addressed.

My second question was about how we can best understand the differences in transformation happening in data import vs. the ebsco Authority Transformer. You said

* I'm not sure how to approach this question without having the corresponding Authority records created by data import to compare with the records created by the AuthorityTransformer. I'll do some more analysis and maybe see if I can extend the reporting from the previous question to this question.

I've loaded the same file authkey.sample3.mrk to -stage. Any thoughts on how we could compare the json from each process? Or do you think there is a better approach to take to try and figure out if there are additional differences not discovered yet?

Once we identify them do you think they should be ticketed in the FSE project?

@jermnelson
Copy link
Collaborator

jermnelson commented Dec 11, 2024

For the first pass and analyzing the differences between the Authority records produced by Ebsco tools and Data Import, I generated the following table comparing the identifiers between the two tools for all of the records in the authkey.sample3.mrc file. The first table contains all of the records that produced equal number of identifiers between the two methods:

FOLIO-FSEData Import
Record 001IdentifierValueIdentifierValue
no2010101968System control number(DLC)no2010101968Placeholder(OCoLC)oca08536933
System control number(OCoLC)oca08536933Control numberno2010101968
LCCNno2010101968LCCNno2010101968
n 2012008843System control number(DLC)n 2012008843Placeholder(OCoLC)oca09109265
System control number(OCoLC)oca09109265Control numbern 2012008843
LCCNn 2012008843LCCNn 2012008843
n 93023694System control number(DLC)n 93023694Placeholder(OCoLC)oca03334744
System control number(OCoLC)oca03334744Control numbern 93023694
LCCNn 93023694LCCNn 93023694
nr 96025045System control number(DLC)nr 96025045Placeholder(OCoLC)oca04124484
System control number(OCoLC)oca04124484Control numbernr 96025045
LCCNnr 96025045LCCNnr 96025045
n 84207914System control number(DLC)n 84207914Placeholder(OCoLC)oca02406949
System control number(OCoLC)oca02406949Control numbern 84207914
LCCNn 84207914LCCNn 84207914
no2008057654System control number(DLC)no2008057654Placeholder(OCoLC)oca07735917
System control number(OCoLC)oca07735917Control numberno2008057654
LCCNno2008057654LCCNno2008057654
nr 95002253System control number(DLC)nr 95002253Placeholder(OCoLC)oca03758927
System control number(OCoLC)oca03758927Control numbernr 95002253
LCCNnr 95002253LCCNnr 95002253
n 78095680System control number(DLC)n 78095680Placeholder(OCoLC)oca00230756
System control number(OCoLC)oca00230756Control numbern 78095680
LCCNn 78095680LCCNn 78095680
n 78095679System control number(DLC)n 78095679Placeholder(OCoLC)oca00230755
System control number(OCoLC)oca00230755Control numbern 78095679
LCCNn 78095679LCCNn 78095679
n 79081460System control number(DLC)n 79081460Placeholder(OCoLC)oca00314001
System control number(OCoLC)oca00314001Other standard identifier0000000121445760 isni
Other standard identifier0000000121445760 isniOther standard identifier96992551 viaf
Other standard identifier96992551 viafOther standard identifierhttp://cantic.bnc.cat/registres/fitxa/26163 uri
Other standard identifierhttp://cantic.bnc.cat/registres/fitxa/26163 uriOther standard identifierhttp://catalogue.bnf.fr/ark:/12148/cb119254833 uri
Other standard identifierhttp://catalogue.bnf.fr/ark:/12148/cb119254833 uriOther standard identifierhttp://ci.nii.ac.jp/author/DA00818556 uri
Other standard identifierhttp://ci.nii.ac.jp/author/DA00818556 uriOther standard identifierhttp://d-nb.info/gnd/118617338 uri
Other standard identifierhttp://d-nb.info/gnd/118617338 uriOther standard identifierhttp://dbpedia.org/page/JohnSteinbeck uri
Other standard identifierhttp://dbpedia.org/page/JohnSteinbeck uriOther standard identifierhttp://isni.org/isni/0000000121445760 uri
Other standard identifierhttp://isni.org/isni/0000000121445760 uriOther standard identifierhttp://nla.gov.au/anbd.aut-an35522183 uri
Other standard identifierhttp://nla.gov.au/anbd.aut-an35522183 uriOther standard identifierhttp://nla.gov.au/nla.party-983264 uri
Other standard identifierhttp://nla.gov.au/nla.party-983264 uriOther standard identifierhttp://openplaques.org/people/4813 uri
Other standard identifierhttp://openplaques.org/people/4813 uriOther standard identifierhttp://viaf.org/viaf/96992551 uri
Other standard identifierhttp://viaf.org/viaf/96992551 uriOther standard identifierhttp://vocab.getty.edu/ulan/500341772 uri
Other standard identifierhttp://vocab.getty.edu/ulan/500341772 uriOther standard identifierhttp://www.fantascienza.com/catalogo/autori/NILF15047 uri
Other standard identifierhttp://www.fantascienza.com/catalogo/autori/NILF15047 uriOther standard identifierhttp://www.idref.fr/02714805X uri
Other standard identifierhttp://www.idref.fr/02714805X uriOther standard identifierhttp://www.imdb.com/name/nm0825705 uri
Other standard identifierhttp://www.imdb.com/name/nm0825705 uriOther standard identifierhttp://www.wikidata.org/entity/Q39212 uri
Other standard identifierhttp://www.wikidata.org/entity/Q39212 uriOther standard identifierhttps://musicbrainz.org/artist/3306ba20-06ab-4af8-96f2-e8aeb3946 1ee uri
Other standard identifierhttps://musicbrainz.org/artist/3306ba20-06ab-4af8-96f2-e8aeb3946 1ee uriOther standard identifierhttps://openlibrary.org/authors/OL25788A uri
Other standard identifierhttps://openlibrary.org/authors/OL25788A uriOther standard identifierhttps://www.freebase.com/m/04107 uri
Other standard identifierhttps://www.freebase.com/m/04107 uriControl numbern 79081460
LCCNn 79081460LCCNn 79081460
n 42742214System control number(DLC)n 42742214Placeholder(OCoLC)oca00035656
System control number(OCoLC)oca00035656Control numbern 42742214
LCCNn 42742214LCCNn 42742214
sh 85119265System control number(DLC)sh 85119265Control numbersh 85119265
LCCNsh 85119265LCCNsh 85119265
sh2008111033System control number(DLC)sh2008111033Control numbersh2008111033
LCCNsh2008111033LCCNsh2008111033
sh 85046925System control number(DLC)sh 85046925Control numbersh 85046925
LCCNsh 85046925LCCNsh 85046925
sh2010118155System control number(DLC)sh2010118155Control numbersh2010118155
LCCNsh2010118155LCCNsh2010118155
sh 85069833System control number(DLC)sh 85069833Control numbersh 85069833
LCCNsh 85069833LCCNsh 85069833
sh 87002438System control number(DLC)sh 87002438Control numbersh 87002438
LCCNsh 87002438LCCNsh 87002438
sh 85004386System control number(DLC)sh 85004386Control numbersh 85004386
LCCNsh 85004386LCCNsh 85004386
sh 85004387System control number(DLC)sh 85004387Control numbersh 85004387
LCCNsh 85004387LCCNsh 85004387
gf2014027077System control number(DLC)gf2014027077Control numbergf2014027077
LCCNgf2014027077LCCNgf2014027077
gf2014026111System control number(DLC)gf2014026111Control numbergf2014026111
LCCNgf2014026111LCCNgf2014026111
gf2014026094System control number(DLC)gf2014026094Control numbergf2014026094
LCCNgf2014026094LCCNgf2014026094
gf2014026542System control number(DLC)gf2014026542Control numbergf2014026542
LCCNgf2014026542LCCNgf2014026542
gf2014026049System control number(DLC)gf2014026049Control numbergf2014026049
LCCNgf2014026049LCCNgf2014026049
gf2014026329System control number(DLC)gf2014026329Control numbergf2014026329
LCCNgf2014026329LCCNgf2014026329
gf2014026339System control number(DLC)gf2014026339Control numbergf2014026339
LCCNgf2014026339LCCNgf2014026339
gf2014026725System control number(DLC)gf2014026725Control numbergf2014026725
LCCNgf2014026725LCCNgf2014026725
gf2014026854System control number(DLC)gf2014026854Control numbergf2014026854
LCCNgf2014026854LCCNgf2014026854
n 80008740System control number(DLC)n 80008740Placeholder(OCoLC)oca00391295
System control number(OCoLC)oca00391295Control numbern 80008740
LCCNn 80008740LCCNn 80008740
n 79069821System control number(DLC)n 79069821Placeholder(OCoLC)oca00302642
System control number(OCoLC)oca00302642Placeholder(Uk)000382747
System control number(Uk)000382747Control numbern 79069821
LCCNn 79069821LCCNn 79069821
n 78034868System control number(DLC)n 78034868Placeholder(OCoLC)oca00171044
System control number(OCoLC)oca00171044Control numbern 78034868
LCCNn 78034868LCCNn 78034868
n 80092173System control number(DLC)n 80092173Placeholder(OCoLC)oca00473269
System control number(OCoLC)oca00473269Control numbern 80092173
LCCNn 80092173LCCNn 80092173
sh2012004126System control number(DLC)sh2012004126Control numbersh2012004126
LCCNsh2012004126LCCNsh2012004126

@jermnelson
Copy link
Collaborator

In the authkey.sample3.mrc file, four records had different number of identifiers for the Ebco tools compared to the data import. Examining the differences, all four had similar causes due to different handling of multiple sub-fields in 010 field.

For example, the MARC record with 001 value of sh 85007461 has the following 010 field:

=010  \\$ash 85007461 $zsh 85007854

The Authority Record generated by the Ebsco Tools only creates the following identifier that combines the values of subfield a and z:

[{'identifierTypeId': 'c858e4f2-2b6b-4385-842b-60732ee14abb',
   'value': 'sh 85007461 sh 85007854'}

The Data Import generated Authority Record creates separate identifiers fields for each of the subfield a and subfield z:

{'value': 'sh 85007461',
   'identifierTypeId': 'c858e4f2-2b6b-4385-842b-60732ee14abb'},
  {'value': 'sh 85007854',
   'identifierTypeId': 'c858e4f2-2b6b-4385-842b-60732ee14abb'}

@jermnelson
Copy link
Collaborator

Fields in the Data Import record that are not in the Ebsco Tools records

  • Because the Data Import records are in FOLIO, they contain two fields that aren't present in the Ebsco Tools authority records, metadata and _version
  • All Data Import records have subjectHeadings field generated from the 008 that aren't in the Ebsco Tools (the AuthorityMapper class ignores the FOLIO Authority mapping for the 001 and 008.

@ahafele
Copy link
Author

ahafele commented Dec 13, 2024

Thanks for all of this analysis Jeremy! I'm glad to see there weren't any additional gotchas (except for the 010 subfield concatenation). Are you feeling confident that we could

Modify the Authority JSON record based Data Import processing rules and mapping

If so let's chat next week to at least get some of this ticketed before winter break.

@ahafele
Copy link
Author

ahafele commented Jan 10, 2025

FSE community has recommended a different path - see slack responses.

Jeremy has test loaded a few authority records using https://github.com/FOLIO-FSE/folio_data_import to folio-test. I looked at one record and all looks good. Default auth DI profile was used. We would like to test load a larger file to get a sense of timings. @jermnelson you can find those here authkey.de/dd

I'll find some bibs to test with and write that up in #1432.

@ahafele
Copy link
Author

ahafele commented Jan 14, 2025

@jermnelson reports that 50k authority records took 6 hours.

@ahafele
Copy link
Author

ahafele commented Jan 14, 2025

@jermnelson wrong PR is linked here

@ahafele ahafele reopened this Jan 14, 2025
@jermnelson
Copy link
Collaborator

Thanks for the correction!

@ahafele
Copy link
Author

ahafele commented Jan 17, 2025

@jermnelson what are your thoughts on next steps here? I assume we should wait till Q is up on test/stage and then try to improve the loading times? Currently what I have recorded is 50k authority records took 6 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants