Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change tag upsert strategy to drop old provider tags #4732

Closed
sarayourfriend opened this issue Aug 8, 2024 · 2 comments · Fixed by #4752
Closed

Change tag upsert strategy to drop old provider tags #4732

sarayourfriend opened this issue Aug 8, 2024 · 2 comments · Fixed by #4752
Assignees
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases ✨ goal: improvement Improvement to an existing user-facing feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@sarayourfriend
Copy link
Collaborator

Problem

As part of the discussion in #4452, we've decided that our historical strategy of merging old and new provider tags when reingesting a work is problematic. To quote @stacimc in that discussion:

Although I agree we should retain data even if we are initially skeptical of its usefulness (deleted/modified tags were likely low quality tags, as you point out), the issue to me remains privacy/ethical concerns with deliberately retaining this historical data that is intentionally changed at source. I can't think of a use case for this data that would avoid those issues.

To expand on what the ethical and privacy concerns are: Openverse has an ethical responsibility to represent works as they are represented by the upstream provider, and to make clear when we are intentionally augmenting the description of a work. To that end, if the set of tags change in a provider, removed or modified tags must be reflected in Openverse's dataset. For example, if a museum decides on a new way of describing a work, and previous tags were culturally insensitive, it is important for Openverse not to reproduce that insensitivity, especially as we are representing those tags as specifically from the provider. Similarly, considering a privacy perspective, if tags at the upstream source originally include privacy invading information (e.g., the name of a pictured individual, sensitive location information, etc), and the upstream source removes them, it's critical that Openverse also no longer retains those tags after reingestion.

Keep in mind that Openverse does not current reingest most of its data (namely for dated provider DAGs like Flickr, Wikimedia), so for the vast majority of works, we will still have the problem of potentially retaining stale/incorrect tag information from the provider. Future work may be planned to selectively reingest works on a periodic basis (e.g., works returned in search queries, works for which the metadata may actually be seen by individuals).

Furthermore, as Staci pointed out in the priorities meeting yesterday, every other piece of metadata follows a replace, rather than merge approach. Provider tags have been unique in this way, and there's no reason to maintain this, and ample reason to change it.

Description

Change the jsonb_array column strategy to drop all existing provider tags. Non-provider tags (e.g., machine generated tags) must be retained. Existing provider tags should be dropped entirely in favour of incoming provider tags.

def _merge_jsonb_arrays(column: str) -> str:
return f"""{column} = COALESCE(
(
SELECT jsonb_agg(DISTINCT x)
FROM jsonb_array_elements(old.{column} || EXCLUDED.{column}) t(x)
),
EXCLUDED.{column},
old.{column}
)"""

The current strategy is written in a generic form, with column parenthesised, but I think we need to replace it with a tags-specific merge strategy, because the provider field on the tags is tags specific. Other jsonb array columns should use their own approach.

Something like the following might be a good starting point:

def _merge_tags(*args) -> str:
    return dedent(
        """
        tags = COALESCE(
            (
                SELECT jsonb_path_query_array(
                    old.tags,
                    '$ ? (@.provider == "$provider")',
                    jsonb_build_object('provider', old.provider)
                ) || EXCLUDED.tags
            ),
            EXCLUDED.tags,
            old.tags
        )
        """
    ).strip()

Though I'm still confused about whether EXCLUDED is definitely the new tags or something else (the name is confounding).

Additional context

Blocks #4452.

@sarayourfriend sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature 🧱 stack: catalog Related to the catalog and Airflow DAGs 🗄️ aspect: data Concerns the data in our catalog and/or databases labels Aug 8, 2024
@sarayourfriend sarayourfriend self-assigned this Aug 8, 2024
@stacimc
Copy link
Contributor

stacimc commented Aug 9, 2024

One minor correction: while it's true that Flickr reingestion is paused due to rate limiting concerns, we do actually have reingestion enabled for Wikimedia! Granted Flickr is the significantly larger source, so your point very much stands. I would love to prioritize fixing that.

@sarayourfriend
Copy link
Collaborator Author

we do actually have reingestion enabled for Wikimedia

Ah, thank you! I'll update the description. Is that ongoing reingestion? Based on the discussion in #3659 I assumed we did not have ongoing reingestion for Wikimedia.

Something to follow up on after this issue might be to decide a timeline for reingestion of records that haven't been looked at in a long time. If we take seriously the ethical necessity of representing works as they are described upstream, it would be disingenuous to not figure out ways of periodically making sure works are represented in Openverse as the upstream describes them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases ✨ goal: improvement Improvement to an existing user-facing feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants