Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add filtered index and promotion steps to data refresh (#4833)
* Use hard coded index config for new index, instead of basing off existing index The index configuration should be hard-coded here to match the behavior of the ingestion server. This is also important because we require permanent changes to the index configuration to go through the code review process. Other DAGs can be used to rapidly iterate on index configuration, and changes can be persisted here only if wanted. * Get sensitive terms list * Add steps to create and populate the filtered index * Pull out run_sql utility * First attempt to generate table indices using dynamic mapping * Don't dynamically map separate tasks for creating each index This isn't possible because dynamically mapping tasks within a dynamically mapped taskgroup is not supported in our version of Airflow. * Add apply_constraints steps mirroring ingestion server implementation * Correctly drop orphans referencing records that do not exist in the temp table * Do not remap unique constraints The problem is that in our local development environments, we initially set up the tables in the API database slightly differently from the way they are set up in production. Specifically, for the three unique field constraints (url, identifier, foreign_id/provider) we apply a UNIQUE CONSTRAINT using a statement like this: ``` ALTER TABLE ONLY public.audio ADD CONSTRAINT audio_identifier_key UNIQUE (identifier); ``` Under the covers, postgres creates a unique index with this same name. In production, we do not have an explicit CONSTRAINT, just the unique index. This is a subtle distinction and basically an implementation detail that achieves the same thing. You can see the difference by running `DESCRIBE audio` in each case: Indices on the audio table in local development (**before running any data refresh**, even the one that is run in load_sample_data) -- not the 'CONSTRAINT': ``` Indexes: "audio_pkey" PRIMARY KEY, btree (id) "audio_identifier_key" UNIQUE CONSTRAINT, btree (identifier) "audio_url_key" UNIQUE CONSTRAINT, btree (url) "unique_provider_audio" UNIQUE CONSTRAINT, btree (foreign_identifier, provider) ... ``` Compared to indices on the production audio table: ``` Indexes: "audio_pkey" PRIMARY KEY, btree (id) "audio_identifier_key" UNIQUE, btree (identifier) "audio_url_key" UNIQUE, btree (url) "unique_provider_audio" UNIQUE, btree (foreign_identifier, provider) ... ``` In fact the only explicit CONSTRAINTS the production media tables have are primary key constraints and some foreign key constraints. Now: when a data refresh runs, after it creates the temp table and copies in all records from the catalog, it first copies all the indices from the media table onto the temp table, and then copies over all the constraints. Remember that even if you have an explicit UNIQUE CONSTRAINT, postgres automatically gives you an index as well. So in that first step, regardless of whether the table being operated on has CONSTRAINTs set explicitly, the unique *indices* are all copied over. In the second step, in theory the constraints should be copied over. The ingestion server generates a bunch of ALTER TABLE statements to drop the constraints from the media table and then apply them to the temp table instead. But they are all generated incorrectly and essentially do nothing. What we get looks like: ``` ALTER TABLE audio DROP CONSTRAINT audio_url_key ALTER TABLE audio ADD UNIQUE (url) ``` We'd expect to see something like: ``` ALTER TABLE temp_import_audio ADD CONSTRAINT audio_url_key UNIQUE (url) ``` But if we fix this statement to be correct, it actually causes an error with the unique constraints -- because we have already copied over the indices in a previous step, and the index has the same name as the constraint! It is not possible to add the CONSTRAINT to the existing index without changing the name -- and at any rate, it's not necessary because the index enforces the uniqueness on its own, and this is not happening in production. So TLDR what the ingestion server is doing here ends up being a NO-OP, but it just happens to be okay because those constraints are also supported via the index, and because the primary key constraint and foreign key constraints that we actually need are handled differently (and correctly). Since these steps are literally not doing anything, we can just delete them; but this will cause an issue in the future if we ever add any new non-primary/fk constraints to the API media tables (they'll be wiped away during a data refresh). But if we *fix* the generated ALTER table statements, we introduce a new and different problem. The solution proposed here is to exclude UNIQUE constraints from the remapping step, in the same way that primary keys are excluded. **This was also extremely confusing if you're debugging this stuff closely, because you'll notice all those NO-OP ALTERs get run __only the first time a data refresh is run__. Thereafter, those CONSTRAINTS don't exist and the output is no longer confusing. And once a data refresh is run, as part of the init scripts, the evidence of the constraints is gone so it's hard to debug!** Fixing the local dev env to match production is also recommended. * Add promote table steps * Promote indices * Add map index template to dynamic promotion tasks * Fix bug with index offsets when running consecutive data refreshes * Add docstrings, split index/constraints out into separate files * Do not promote if reindexing fails! * Add tests * Make filtered index timeouts configurable * Ensure all copy data steps run one at a time * Fix dependencies between index and constraint remapping There was an error in the task dependencies caused by the fact that we were returning the output of transform_index_defs from the remap_indices taskgroup. Normally, the last task in a TaskGroup is the task that is set upstream of any downstream dependencies (in this case, the remap_constraints group). Apparently if you add a return statement to a TaskGroup (which we were doing in order to make the index_mapping the XCOM result for the entire taskgroup), this borks the dependency chain. As a result, transform_index_defs was being set as the upstream task to get_all_existing_constraints. This meant that if there was an error in create_table_indices, it would NOT stop the remap_constraints tasks from running. By updating the create_table_indices task to return what we want from XCOMs as well, we can avoid this issue. * Remove unused function, fix typo * Correctly report record count of the promoted index, not the old one * Use EmptyOperator where possible * Clean up variables * Fix incorrect comment * Clean up unused code, variable name * Further break out tests * Protect against table names containing character * Comment on static route
- Loading branch information