Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] WP_Stream_Importer: User-driven incremental import #2013

Merged
merged 36 commits into from
Nov 29, 2024

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Nov 21, 2024

Adds wp-admin support for incrementally importing data from WXR files:

CleanShot 2024-11-27 at 19 07 23@2x

This is a part of #1894

Implementation details

There can be one active import session at any given time. It is started by uploading a WXR file, specifying the URL, and can be extended to any number of data sources. Once created, the admin page shows the current import progress. This PR adds a WP_Import_Session model class to store the progress information and the current import cursor.

Given an active importing session, the admin page will show the current stage and the number of imported entities accompanied by a "Continue Importing" button. When pressed, it calls WP_Stream_Importer::next_step() one or more times to perform a small unit of work. After each call, we collect the progress information from WP_Stream_Importer – be it the number of downloaded asset bytes, the number of inserted database records, the current importing cursor, etc.

next_step() returns true when some progress was made, even if that was a failed image download attempt. It returns false when it reaches the end of the current importing stage, at which point the advance_to_next_stage() method must be called.

After each next_step() or advance_to_next_stage() call, the WP_Stream_Importer::get_reentrancy_cursor() returns a string that can be used to create a new importer that will resume from the exact same place. The cursor means we got this far, not we got this far and no further. The record the cursor points to may have already been processed. In the upcoming PRs we'll need to either point to the next entity, or invent an idempotent import semantics where processing the same record twice leads to the same outcome as processing it once.

Resource Budgets

This PR starts exploring resource budgets by introducing a soft time limit and a minimum number of files downloaded during a single frontloading session. We don't support partial download and resuming yet, so we can't settle for downloading less than one file. On the next attempt we'd just discard the result and likely download less than one file again, meaning we would never get past the frontloading step.

Testing instructions

  1. cd packages/playground/data-liberation/tests/import
  2. bash run.sh
  3. Go to wp-admin
  4. Go to the Data Liberation page
  5. Upload the a11y xml file from the WXR test set shipped in packages/playground/data-liberation/tests/wxr/a11y-unit-test-data.xml
  6. Click through all the import steps
  7. Confirm the assets are downloaded are expected and that, eventually, every click of the "continue" button imports one more entity

Base automatically changed from reentrant-WP_Stream_Importer to trunk November 22, 2024 11:27
@adamziel adamziel changed the title [Data Liberation] Expose progress information from WP_Stream_Importer [Data Liberation] WP_Stream_Importer: Partial import, pausing, resuming, communicating progress Nov 27, 2024
@adamziel adamziel changed the title [Data Liberation] WP_Stream_Importer: Partial import, pausing, resuming, communicating progress [Data Liberation] WP_Stream_Importer: Incremental import Nov 28, 2024
@adamziel adamziel marked this pull request as ready for review November 28, 2024 10:47
@adamziel adamziel requested a review from a team as a code owner November 28, 2024 10:47
Copy link
Collaborator

@zaerl zaerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An excellent step forward, Adam, I like it. Using custom-type posts is a good one. I am ok with merging this. I just left a couple of comments.

break;
}

$post_id = wp_insert_post(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using custom type posts is a great idea. 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I was looking for a way to reuse as much of what we already have as possible. A custom table crossed my mind, and we still might need one for the vector clock, but for managing metadata post types and meta seem perfect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In the near future, we will probably need a place to save binary data and similar data. But now, using custom types posts for this is perfectly fine.

@zaerl zaerl merged commit a2cb181 into trunk Nov 29, 2024
9 of 10 checks passed
@zaerl zaerl deleted the wp-steram-importer-monitor-progress branch November 29, 2024 12:49
@adamziel adamziel changed the title [Data Liberation] WP_Stream_Importer: Incremental import [Data Liberation] WP_Stream_Importer: User-driven incremental import Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants