Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra statements and blank nodes sometimes added to Fedora Collection objects causing slowdowns #2885

Open
wickr opened this issue Jul 17, 2023 · 1 comment
Assignees
Labels
Metadata Issues related to metadata configuration, application, and representation Priority - Medium Issues that should be prioritized ahead of low but not immediately critical - bulk of work cycles

Comments

@wickr
Copy link
Member

wickr commented Jul 17, 2023

Descriptive summary

We're seeing some Fedora Collection objects get thousands of extra statements and blank nodes created, which causes extreme slowdowns any time the Collection object is loaded, such as in the console or another job, or in the web view ( #2884 ).

While this seems to be happening when the Bulkrax::CreateRelationshipsJob runs after an ingest or update, I'm not sure if it's specifically a Bulkrax issue or related to how we added support for controlled vocabs to Collection objects.

These are visible in the Fedora web UI:

Screenshot 2023-07-17 at 10 03 16 AM

The URI(s) for Institution (LC URIs) and also Creator (from opaquenamespace.org) get fetched, and those returned statements get saved to the Collection object itself. So we'll see labels/titles, comments, related authority statements, etc.

Example statements on osu-scarc Collection object

<http://id.loc.gov/authorities/names/n80017721> <http://www.w3.org/2004/02/skos/core#changeNote> _:t754 .
<http://id.loc.gov/authorities/names/n80017721> <http://www.w3.org/2004/02/skos/core#changeNote> _:t729 .
<http://id.loc.gov/authorities/names/n80017721> <http://www.w3.org/2004/02/skos/core#changeNote> _:t854 .
<http://id.loc.gov/authorities/names/n80017721> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.loc.gov/mads/rdf/v1#Organization> .
<http://id.loc.gov/authorities/names/n80017721> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.loc.gov/mads/rdf/v1#Authority> .
<http://id.loc.gov/authorities/names/n80017721> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
<http://id.loc.gov/authorities/names/n80017721> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.loc.gov/mads/rdf/v1#CorporateName> .
<http://id.loc.gov/authorities/names/n80017721> <http://www.w3.org/2004/02/skos/core#inScheme> <http://id.loc.gov/authorities/names> .
<http://id.loc.gov/authorities/names/n80017721> <http://www.w3.org/2004/02/skos/core#altLabel> "Oregon. State University" .
<http://id.loc.gov/authorities/names/n80017721> <http://www.w3.org/2004/02/skos/core#altLabel> "OSU (Oregon State University)" .

How to Fix

These can be deleted in the Fedora web UI by going to the Collection object, finding the 'Update Properties' on the right-hand side, scrolling down past the prefixes, and replacing the block:

DELETE {}
INSERT {}
WHERE {}

with these (if there's too many statements, may have to do this in smaller groups):

DELETE WHERE { <> ns034:adminMetadata ?o };

DELETE WHERE { <> ns034:elementList ?o };
DELETE WHERE { <> ns034:hasSource ?o };
DELETE WHERE { <> ns034:hasVariant ?o };
DELETE WHERE { <> ns044:changeNote ?o };
DELETE WHERE { <> ns048:altLabel ?o };

DELETE WHERE { <> ns034:hasCloseExternalAuthority ?o };
DELETE WHERE { <> ns034:hasExactExternalAuthority ?o };
DELETE WHERE { <> ns034:hasRelatedAuthority ?o };
DELETE WHERE { <> ns034:identifiesRWO ?o };
DELETE WHERE { <> ns034:isMemberOfMADSCollection ?o };
DELETE WHERE { <> ns034:isMemberOfMADSScheme ?o };
DELETE WHERE { <> ns034:hasEarlierEstablishedForm ?o };
DELETE WHERE { <> ns034:editorialNote ?o };
DELETE WHERE { <> ns044:altLabel ?o };
DELETE WHERE { <> ns044:closeMatch ?o };
DELETE WHERE { <> ns044:exactMatch ?o };
DELETE WHERE { <> ns044:inScheme ?o };
DELETE WHERE { <> ns044:prefLabel ?o };
DELETE WHERE { <> ns044:editorial ?o };
DELETE WHERE { <> ns034:authoritativeLabel ?o };
DELETE WHERE { <> ns051:lccn ?o };
DELETE WHERE { <> ns051:local ?o };
DELETE WHERE { <> ns044:semanticRelation ?o };

DELETE WHERE { <> rdf:type <http://www.loc.gov/mads/rdf/v1#Authority> };
DELETE WHERE { <> rdf:type <http://www.loc.gov/mads/rdf/v1#CorporateName> };
DELETE WHERE { <> rdf:type <http://www.loc.gov/mads/rdf/v1#Organization> };

DELETE WHERE { <> rdfs:label ?o };
DELETE WHERE { <> rdfs:comment ?o };
DELETE WHERE { <> rdf:type <http://www.w3.org/2004/02/skos/core#PersonalName> };
DELETE WHERE { <> rdf:type <http://www.w3.org/2004/02/skos/core#Concept> };
DELETE WHERE { <> ns002:date ?o };
DELETE WHERE { <> ns002:issued ?o };
DELETE WHERE { <> ns002:modified ?o };

These delete the unwanted statements without deleting the correct statements in Fedora. The prefixes are already defined in Fedora and in the text box. It can take a long time to submit and process, but the web view should reload with fewer listed statements eventually.

Expected behavior

Collection objects in Fedora only have the metadata they're supposed to and load quickly.

Related work

Link to related tickets or prior related work here.

Accessibility Concerns

Add any information here to indicate any known or suspected accessibility issues for this ticket

@wickr wickr added the Metadata Issues related to metadata configuration, application, and representation label Jul 17, 2023
@wickr wickr self-assigned this Jul 17, 2023
@KevinJonesMeta KevinJonesMeta added the Priority - Medium Issues that should be prioritized ahead of low but not immediately critical - bulk of work cycles label Apr 30, 2024
@wickr
Copy link
Member Author

wickr commented Jun 21, 2024

I saw today that this occured again on several collections, osu-scarc, osu-historical-publications, siuslaw, and aerial-photos-mid-willamette. These are definitely ones we've been ingesting into with bulkrax. There were a few hundred blank nodes on each, so not nearly as bad as before, and I cleaned out these and checked for any others (by doing Collection.all and scanning for any Load LDP (14.9ms) statements that are larger than about 100ms.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Metadata Issues related to metadata configuration, application, and representation Priority - Medium Issues that should be prioritized ahead of low but not immediately critical - bulk of work cycles
Projects
Development

No branches or pull requests

2 participants