Change tag upsert strategy to drop old provider tags #4732

sarayourfriend · 2024-08-08T12:51:28Z

Problem

As part of the discussion in #4452, we've decided that our historical strategy of merging old and new provider tags when reingesting a work is problematic. To quote @stacimc in that discussion:

Although I agree we should retain data even if we are initially skeptical of its usefulness (deleted/modified tags were likely low quality tags, as you point out), the issue to me remains privacy/ethical concerns with deliberately retaining this historical data that is intentionally changed at source. I can't think of a use case for this data that would avoid those issues.

To expand on what the ethical and privacy concerns are: Openverse has an ethical responsibility to represent works as they are represented by the upstream provider, and to make clear when we are intentionally augmenting the description of a work. To that end, if the set of tags change in a provider, removed or modified tags must be reflected in Openverse's dataset. For example, if a museum decides on a new way of describing a work, and previous tags were culturally insensitive, it is important for Openverse not to reproduce that insensitivity, especially as we are representing those tags as specifically from the provider. Similarly, considering a privacy perspective, if tags at the upstream source originally include privacy invading information (e.g., the name of a pictured individual, sensitive location information, etc), and the upstream source removes them, it's critical that Openverse also no longer retains those tags after reingestion.

Keep in mind that Openverse does not current reingest most of its data (namely for dated provider DAGs like Flickr, Wikimedia), so for the vast majority of works, we will still have the problem of potentially retaining stale/incorrect tag information from the provider. Future work may be planned to selectively reingest works on a periodic basis (e.g., works returned in search queries, works for which the metadata may actually be seen by individuals).

Furthermore, as Staci pointed out in the priorities meeting yesterday, every other piece of metadata follows a replace, rather than merge approach. Provider tags have been unique in this way, and there's no reason to maintain this, and ample reason to change it.

Description

Change the jsonb_array column strategy to drop all existing provider tags. Non-provider tags (e.g., machine generated tags) must be retained. Existing provider tags should be dropped entirely in favour of incoming provider tags.

openverse/catalog/dags/common/storage/columns.py

Lines 70 to 78 in de0079d

    
           def _merge_jsonb_arrays(column: str) -> str: 
        
               return f"""{column} = COALESCE( 
        
                      ( 
        
                        SELECT jsonb_agg(DISTINCT x) 
        
                        FROM jsonb_array_elements(old.{column} || EXCLUDED.{column}) t(x) 
        
                      ), 
        
                      EXCLUDED.{column}, 
        
                      old.{column} 
        
                    )"""

The current strategy is written in a generic form, with column parenthesised, but I think we need to replace it with a tags-specific merge strategy, because the provider field on the tags is tags specific. Other jsonb array columns should use their own approach.

Something like the following might be a good starting point:

def _merge_tags(*args) -> str:
    return dedent(
        """
        tags = COALESCE(
            (
                SELECT jsonb_path_query_array(
                    old.tags,
                    '$ ? (@.provider == "$provider")',
                    jsonb_build_object('provider', old.provider)
                ) || EXCLUDED.tags
            ),
            EXCLUDED.tags,
            old.tags
        )
        """
    ).strip()

Though I'm still confused about whether EXCLUDED is definitely the new tags or something else (the name is confounding).

Additional context

Blocks #4452.

The text was updated successfully, but these errors were encountered:

stacimc · 2024-08-09T18:30:12Z

One minor correction: while it's true that Flickr reingestion is paused due to rate limiting concerns, we do actually have reingestion enabled for Wikimedia! Granted Flickr is the significantly larger source, so your point very much stands. I would love to prioritize fixing that.

sarayourfriend · 2024-08-10T10:14:45Z

we do actually have reingestion enabled for Wikimedia

Ah, thank you! I'll update the description. Is that ongoing reingestion? Based on the discussion in #3659 I assumed we did not have ongoing reingestion for Wikimedia.

Something to follow up on after this issue might be to decide a timeline for reingestion of records that haven't been looked at in a long time. If we take seriously the ethical necessity of representing works as they are described upstream, it would be disingenuous to not figure out ways of periodically making sure works are represented in Openverse as the upstream describes them.

sarayourfriend self-assigned this Aug 8, 2024

sarayourfriend mentioned this issue Aug 13, 2024

Always use only latest provider tags when reingesting #4752

Merged

7 tasks

sarayourfriend closed this as completed in #4752 Aug 14, 2024

krysal mentioned this issue Aug 16, 2024

Data normalization #430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change tag upsert strategy to drop old provider tags #4732

Change tag upsert strategy to drop old provider tags #4732

sarayourfriend commented Aug 8, 2024

stacimc commented Aug 9, 2024

sarayourfriend commented Aug 10, 2024

Change tag upsert strategy to drop old provider tags #4732

Change tag upsert strategy to drop old provider tags #4732

Comments

sarayourfriend commented Aug 8, 2024

Problem

Description

Additional context

stacimc commented Aug 9, 2024

sarayourfriend commented Aug 10, 2024