-
Notifications
You must be signed in to change notification settings - Fork 270
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Data Liberation] Re-entrant WP_Stream_Importer (#2004)
Adds re-entrancy semantics to the importer API to enable pausing and resuming data imports: ```php $wxr_path = __DIR__ . '/tests/fixtures/wxr-simple.xml'; $importer = WP_Stream_Importer::create_for_wxr_file( $wxr_path ); // Do some work for($i = 0;$i<10;$i++) { $importer->next_step(); } // Save our progress $cursor = $importer->get_reentrancy_cursor(); // Continue where we left off later on $new_importer = WP_Stream_Importer::create_for_wxr_file( $wxr_path, [], $cursor ); $new_importer->next_step(); ``` ## Motivation Most WordPress importers fail because they assume a happy path: we have enough memory, we have enough time, all the assets will be available, and so on. In Data Liberation, I want to assume the worst possible path through thorny quicksand in full sun with venomous wasps stinging us. We'll run out of memory after the first post, all the assets will be 40GB large, and half of them won't be possible to download. Pausing, resuming, and recovering from errors should be a basic primitive of the system. The first step to supporting that is the ability to suspend the import operation and restart it from the same spot later on. And that's exactly what this PR adds. ## Re-entrancy interface This PR doesn't store any information in the database yet. It merely adds the plumbing for pausing and resuming the `WP_Stream_Importer` instance. ### WP_Byte_Stream re-entrancy The `WP_Byte_Stream` interface directly exposes a `tell(): int` and `seek($offset)` methods. There's no need for anything fancier than that – we're only interested in an offset in the stream. It seems to work well for simple byte streams. My only worry is we may need to revisit this interface later on to support fetching fixed-size chunks from large files using byte ranges. ### WP_XML_Processor re-entrancy `WP_XML_Processor` supports exporting state via: * A `get_reentrancy_cursor()` method * Resuming via a static `create($xml, $options, $cursor=null)`. * Seeking the input stream to the correct location via `get_token_byte_offset_in_the_input_stream()` No method in the XML processor API will ever accept the cursor or the byte offset as a way of moving to another location in the document. You can only create a new XML processor at `$cursor`. This is a measure to: * Discourage using the byte offsets for manual string operations on the XML document. It's a footgun and most API consumers who would try that would just introduce bugs into their codebase. * Make it impossible to misuse the re-entrancy API for `seek()`-ing. We already have named bookmarks for that. Usage: ```php $xml = WP_XML_Processor::create_from_string( $xml_bytes ); for($i = 0;$i<10;$i++) { $xml->next_step(); } $cursor = $xml->get_reentrancy_cursor(); $unparsed_xml = substr( $xml_bytes, $xml->get_token_byte_offset_in_the_input_stream() ); $xml2 = WP_XML_Processor::create_from_string( $unparsed_xml, $cursor ); $xml2->next_step(); ``` ### WP_WXR_Reader re-entrancy The `WP_WXR_Reader` class uses the same `get_reentrancy_cursor()` interface as `WP_XML_Processor`. ### WP_Stream_Importer re-entrancy The `WP_Stream_Importer` class uses the same `get_reentrancy_cursor()` interface as `WP_XML_Processor`. See the example at the top of this description. ## Testing instructions TBD. We don't yet have a good way of running PHPUnit in the WordPress context yet. @zaerl is working on running import in CLI, we may need to wait for that before adding tests to this PR and shipping it.
- Loading branch information
Showing
14 changed files
with
695 additions
and
370 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4 changes: 2 additions & 2 deletions
4
packages/playground/data-liberation/src/byte-readers/WP_Byte_Reader.php
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.