Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tags to populate subject areas #1233

Merged
merged 9 commits into from
Jan 15, 2025
Merged

Conversation

MatMoore
Copy link
Contributor

@MatMoore MatMoore commented Jan 14, 2025

Now that the ingestion code is setting tags in DataHub for the subject areas, we can start to use these for search/browse instead of the DataHub domain.

This work will allow us to assign entities to more than one subject area, so we don't need to worry about something not precisely fitting into a single category when redefining the taxonomy.

Known issues

I've used the test environment for testing this since we've had some OpenSearch issues on the dev environment. When I compared before and after I noticed a few discrepancies, but the number of entities in each subject area is almost the same.

It seems like the discrepancies are due to two things, neither of which seem like blockers:

  1. Entries in the catalogue that are no longer being updated, but haven't been deleted automatically as part of a stateful ingestion, no longer show up in a subject area (example). These entries should be deleted altogether but for whatever reason they haven't been. This affects CaDeT databases, and some ESDAs (to be expected as these do not belong to an ingestion).
  2. When ingesting CaDeT databases, there are cases where the database contains models that are in different domains. We used to pick a single domain to use for the database, but now we assign tags for all of them (example). Such databases now show up in multiple subject areas. I hadn't intended to change this right away but I think it is the behaviour we ultimately want.

Next steps

  • Currently where there are multiple subjects tags assigned to an entity we just pick the first one. I'm planning to change this to show all of them. At the same time I can look into the css bug that's making it look wonky when the subject area tag flows onto a new line.
  • There is likely some further refactoring that can be done in the home app in Find MoJ data as well, e.g. some remaining references to "domains" that can be cleaned up.

The subject area labels were previously populated from the domain in
Datahub. This now comes from a tag.

Where there are multiple tags, pick the first one for now. In a future
commit, I'll enable multiple subject areas to be displayed.
@MatMoore MatMoore force-pushed the fetch-subject-areas-via-tags branch from 459458b to 0866be3 Compare January 14, 2025 16:10
@MatMoore MatMoore marked this pull request as ready for review January 14, 2025 16:16
@MatMoore MatMoore requested a review from a team as a code owner January 14, 2025 16:16
Copy link
Contributor

@hjribeiro-moj hjribeiro-moj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

datahub_client/parsers.py Show resolved Hide resolved
@MatMoore MatMoore merged commit 9dc0038 into main Jan 15, 2025
8 checks passed
@MatMoore MatMoore deleted the fetch-subject-areas-via-tags branch January 15, 2025 13:48
Copy link

sentry-io bot commented Jan 17, 2025

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ CatalogueError: Unable to execute list domains query / View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants