Description
Next Gen importers
This issue tracks the work related to Data Liberation Phase 2: Importing and Exporting Structured Data, that is:
- Parsers
- Importers
- User and developer tools.
WordPress needs parsers. Not just any parsers, but parsers that are streaming, re-entrant, fast, standard compliant, and tested using a large body of possible inputs. A seemingly simple task such as moving a post to another website requires rewriting the URLs in that post, downloading the assets, and handling network failures. More complex tasks, such as importing a WXR file or transferring an entire site, are even more demanding.
WordPress also needs importers. Not just any importers, but importers that can handle large quantities of data from multitude of data formats, are extensible, and can proceed even when they encounter an error in the middle of the process. The WP_Stream_Importer
class explored in this project is designed to fulfill these goals – see specific PRs below.
Finally, WordPress needs user and developer tools to use these importers. Not just any tools, but tools that work on the web, in CLI, in the Playground, guide the user with useful progress updates, and provide useful recovery paths when the inevitable errors occur. The work tracked here focuses on a wp-admin
page, but the PHP software components are designed for easy reuse outside of wp-admin
.
Tracking – ongoing Issues and PRs
Parsing
- Streaming ZIP64 parser
- [Data Liberation] Add XML API, Stream API, WXR URL Rewriter API #1952
- [Data Liberation] Merge both XML processors into a single WP_XML_Processor #1960
- [Data Liberation] Add blueprints-library as a submodule #1967
- Port ZipStreamReader from adamziel/wxr-normalize php-toolkit#116
- [Data Liberation] Fork humanmade/WordPress-Importer #1968
- [Data Liberation] WP_WXR_Reader #1972
Exporting
Importing
- Rethink mapping IDs.
- A conflict resolution mechanism with filters for plugin authors. Perhaps we won't need one, though.
- [Data Liberation] Define WP_IMPORTING during import #2064
- [Data Liberation] Topological sorter, entities remapping and add missing imports #2030
- [Data Liberation] Don't download assets in WP_Entity_Importer, use the same entity shape as the WP_Stream_Importer produces
- [Data Liberation] Support sourcing Attachments from non-local filesystems
- [Data Liberation] Filesystem entity reader #2125
- [Blueprints] Support Data Liberation importer in the importWxr step #2058
- [Data Liberation] wp_rewrite_urls() #1893
- [Data Liberation] WP_Stream_Importer with support for WXR and Markdown files #1982
- [Data Liberation] wp-admin importer page #2003
- [Data Liberation] Re-entrant WP_Stream_Importer #2004
- [Data Liberation] WP_Stream_Importer: User-driven incremental import #2013
- [Data Liberation] Add WXR import CLI script #2012
Data formats
- [Data Liberation] Add EPub to Blocks converter #2097
- [Data Liberation] Refactor Entity Readers class diagram #2096
- [Data Liberation] Add HTML to Blocks converter #2095
- [Data Liberation] Markdown <-> Blocks converters
- [Data Liberation] Block markup consumers and producers #2121
- [Data Liberation] Recognize self-closing blocks in WP_Block_Markup_Processor #2120
- [Data Liberation] Build markdown importer as phar #2094
- [Data Liberation] Move Markdown importer to a separate package #2093
- [Data Liberation] Add Markdown parsing libraries #2092
Reliability
UI
- Beautiful design for the admin page
- [Data Liberation] "Fetch from a different URL" button for failed media downloads, Interactivity API support #2040
Other
- Move wp_kses_uri_attributes filter to import start/end #2047
- Extension points for plugin-provided URL treatment, e.g. base64_decode specific block attributes before rewriting the URLs
- Streaming SQL import and export
- Streaming ZIP import and export
- Per-row version control (like @dmsnell's vector clock idea from https://core.trac.wordpress.org/ticket/60375)
- Test with 300GB XML file
- PHP dependency management – should we ship all the PHP classes in this repo? Or publish independent plugins for others to start adapting in their work – but with no BC guarantees?
- Move data-liberation WP-CLI command to separate class #2025
Related resources
- Site Transfer Protocol
- Kickoff Data Liberation: Let's Build WordPress-first Data Migration Tools
- 💬 Solving rewriting site URLs in WordPress using the HTML API and URL parser
- 💬 WordPress for Docs
Next phases: Future Data Liberation roadmap
Note
The ideas below are the next phases of the project. They stretch far beyond the medium-term importers work tracked in this issue and only live here to paint the big picture.
- WXR imports
- Fork https://github.com/humanmade/WordPress-Importer. Give attribution to the original team, ping them and start a conversation
- Port it to WP_XML_Tag_Processor
- Start using that fork for importing WXR files in Playground
- Rewrite the imported site URLs
- Use AsyncHTTP\Client for fetching assets
- Make it resumable if it fails halfway through
- Report progress information to the user
- Surface errors to the user, ask how to handle them
- Use in Blueprints
- Sort the imported entities in topological order
- Test with tricky inputs
- Create WP CLI command
- Create a good looking wp-admin page
- Publish it as a standalone plugin to start gathering feedback and bug reports
- Extensibility
- Include extension points to enable custom treatment of any imported entity, block attribute, database row etc. See Solving rewriting site URLs in WordPress using the HTML API and URL parser data-liberation#74 and one of the GitHub discussions referenced in [Data Liberation] wp_rewrite_urls() #1893
- Markdown workflow for editing existing documentation sites from GitHub
- Markdown importer
- Markdown exporter – migrate @dmsnell's Markdown <-> Block markup TypeScript converter from https://github.com/dmsnell/blocky-formats to PHP
- Discuss using Playground to edit Playground docs, Gutenberg docs, and potentially all WordPress docs
- Discuss using it as a drop-in static site generator replacement (e.g. Jekyll)
- Static block markup editor
- Build a simple plugin to import and export .html files representing specific WordPress pages from GitHub.
- Ship a Blueprint that loads Playground Docs into Playground
- We need to have a real use-case for interacting with data liberation on a daily basis and this is one. It's a super low-friction way of maintaining the Playground documentation and WordPress-on-GitHub-pages in general. (cc @bph @akirk)
- Reliable Playground ZIP export / import
- Fork the Sandbox Site plugin
- Improve the SQL export to make it streamable and ensure there are absolutely no issues with escaping
- Rewrite the exported and imported site URLs
- Include extension points to enable custom treatment of any block attribute, database row etc. See one of the GitHub discussions referenced in Kickoff Data Liberation: Let's Build WordPress-first Data Migration Tools #1888
- Consider shipping
.sql
files with the export to potentially enable importing the resulting.zip
in a regular MySQL-based server environment - ...anything else actually?
- "Duplicate Playground" feature
- Iteration 1: Pipe the ZIP export to ZIP import
- Iteration 2: Mount
/wordpress-new
in the duplicated Playground instance, run the PHP export/import code to migrate the site from/wordpress
there - Iteration 3: Keep track of progress, make it resumable regardless of when the process is interrupted. This would enable exporting really big sites
- Direct WordPress <-> WordPress transfer
- Conceptually, this is like running Duplicate Playground over the internet
- Important to keep track of progress and resources versions using a vector clock
- Export / Import UI with scope (users? posts? etc.), error info (image.jpg couldn't be fetched after 3 retries), and error resolution mechanism (specify a different url? upload that image? retry 4th time?)
- Live WordPress <-> WordPress data sync
- Run the WordPress <-> WordPress transfer in a continuous way.
- This is not about collaborative editing in the block editor, although there is likely an overlap around data synchronization.
- Importers version 2 and beyond
- Subtasks outlined in [Data Liberation] Entity Stream Importer
- Import one post at a time, not "all static assets" and then "all posts". Identify each post's dependency graph and frontload that post's dependent data first.
- Resume
.partial
assets download upon import pause and resume. - Resource quotas