Tracking Issue: Next-gen PHP Importers for Data Liberation

## Next Gen importers

This issue tracks the work related to [Data Liberation Phase 2: Importing and Exporting Structured Data](https://github.com/WordPress/data-liberation/discussions/78), that is:

* Parsers
* Importers
* User and developer tools.

WordPress needs parsers. Not just any parsers, but parsers that are streaming, re-entrant, fast, standard compliant, and tested using a large body of possible inputs. A seemingly simple task such as moving a post to another website requires rewriting the URLs in that post, downloading the assets, and handling network failures. More complex tasks, such as importing a WXR file or transferring an entire site, are even more demanding.

WordPress also needs importers. Not just any importers, but importers that can handle large quantities of data from multitude of data formats, are extensible, and can proceed even when they encounter an error in the middle of the process. The `WP_Stream_Importer` class explored in this project is designed to fulfill these goals – see specific PRs below.

Finally, WordPress needs user and developer tools to use these importers. Not just any tools, but tools that work on the web, in CLI, in the Playground, guide the user with useful progress updates, and provide useful recovery paths when the inevitable errors occur. The work tracked here focuses on a `wp-admin` page, but the PHP software components are designed for easy reuse outside of `wp-admin`.

## Tracking – ongoing Issues and PRs

### Parsing

- [ ] Streaming ZIP64 parser
- [x] https://github.com/WordPress/wordpress-playground/pull/1952
- [x] https://github.com/WordPress/wordpress-playground/pull/1960
- [x] https://github.com/WordPress/wordpress-playground/pull/1967
- [x] https://github.com/WordPress/blueprints-library/pull/116
- [x] https://github.com/WordPress/wordpress-playground/pull/1968
- [x] https://github.com/WordPress/wordpress-playground/pull/1972

### Exporting

- [ ] https://github.com/WordPress/wordpress-playground/issues/2055

### Importing

- [ ] Rethink mapping IDs.
- [ ] A conflict resolution mechanism with filters for plugin authors. Perhaps we won't need one, though.
- [ ] https://github.com/WordPress/wordpress-playground/issues/2064
- [ ] https://github.com/WordPress/wordpress-playground/pull/2030
- [x] [[Data Liberation] Don't download assets in WP_Entity_Importer, use the same entity shape as the WP_Stream_Importer produces](https://github.com/WordPress/wordpress-playground/commit/0080f2a9927f03ee41c357ee94e2318a81aa2826)
- [x] [[Data Liberation] Support sourcing Attachments from non-local filesystems](https://github.com/WordPress/wordpress-playground/commit/e74077da10195782660a39c4b7d0fa71cc4e990b)
- [x] https://github.com/WordPress/wordpress-playground/pull/2125
- [x] https://github.com/WordPress/wordpress-playground/pull/2058
- [x] https://github.com/WordPress/wordpress-playground/pull/1893
- [x] #1982
- [x] https://github.com/WordPress/wordpress-playground/pull/2003
- [x] https://github.com/WordPress/wordpress-playground/pull/2004
- [x] https://github.com/WordPress/wordpress-playground/pull/2013
- [x] https://github.com/WordPress/wordpress-playground/pull/2012

### Data formats

- [x] https://github.com/WordPress/wordpress-playground/pull/2097
- [x] https://github.com/WordPress/wordpress-playground/pull/2096
- [x] https://github.com/WordPress/wordpress-playground/pull/2095
- [x] [[Data Liberation] Markdown <-> Blocks converters](https://github.com/WordPress/wordpress-playground/commit/ac7efd392942e9d0b0f3c33bc639a5d64a26d040)
- [x] https://github.com/WordPress/wordpress-playground/pull/2121
- [x] https://github.com/WordPress/wordpress-playground/pull/2120
- [x] https://github.com/WordPress/wordpress-playground/pull/2094
- [x] https://github.com/WordPress/wordpress-playground/pull/2093
- [x] https://github.com/WordPress/wordpress-playground/pull/2092

### Reliability

- [ ] https://github.com/WordPress/wordpress-playground/issues/2019

### UI

- [ ] Beautiful design for the admin page
- [x] https://github.com/WordPress/wordpress-playground/pull/2040

### Other

- [ ] https://github.com/WordPress/wordpress-playground/issues/2047
- [ ] Extension points for plugin-provided URL treatment, e.g. base64_decode specific block attributes before rewriting the URLs
- [ ] Streaming SQL import and export
- [ ] Streaming ZIP import and export
- [ ] Per-row version control (like @dmsnell's vector clock idea from https://core.trac.wordpress.org/ticket/60375)
- [ ] Test with 300GB XML file
- [ ] PHP dependency management – should we ship all the PHP classes in this repo? Or publish independent plugins for others to start adapting in their work – but with no BC guarantees?
- [ ] https://github.com/WordPress/wordpress-playground/issues/2025

## Related resources

* [Site Transfer Protocol](https://core.trac.wordpress.org/ticket/60375)
* [Kickoff Data Liberation: Let's Build WordPress-first Data Migration Tools](https://github.com/WordPress/wordpress-playground/pull/1888)
* [💬 Solving rewriting site URLs in WordPress using the HTML API and URL parser](https://github.com/WordPress/data-liberation/discussions/74)
* [💬 WordPress for Docs](https://github.com/WordPress/wordpress-playground/discussions/1524)


## Next phases: Future Data Liberation roadmap

> [!NOTE]  
> The ideas below are the next phases of the project. They stretch far beyond the medium-term importers work tracked in this issue and only live here to paint the big picture.

- [ ] WXR imports
    - [x] Fork https://github.com/humanmade/WordPress-Importer.  Give attribution to the original team, ping them and start a conversation
    - [x] Port it to [WP_XML_Tag_Processor](https://github.com/WordPress/wordpress-develop/pull/6713)
    - [x] Start using that fork for importing WXR files in Playground
    - [x] Rewrite the imported site URLs
    - [x] Use [AsyncHTTP\Client](https://github.com/WordPress/blueprints-library/pull/113) for fetching assets
    - [x] Make it resumable if it fails halfway through
    - [x] Report progress information to the user
    - [x] Surface errors to the user, ask how to handle them
    - [x] Use in Blueprints
    - [ ] Sort the imported entities in topological order
    - [ ] Test with tricky inputs
    - [ ] Create WP CLI command
    - [ ] Create a good looking wp-admin page
    - [ ] Publish it as a standalone plugin to start gathering feedback and bug reports
- [ ] Extensibility
    - [ ] Include extension points to enable custom treatment of any imported entity, block attribute, database row etc. See https://github.com/WordPress/data-liberation/discussions/74 and one of the GitHub discussions referenced in #1893
- [ ] Markdown workflow for editing existing documentation sites from GitHub
    - [x] Markdown importer 
    - [ ] Markdown exporter – migrate @dmsnell's Markdown <-> Block markup TypeScript converter from https://github.com/dmsnell/blocky-formats to PHP
    - [ ] Discuss using Playground to edit Playground docs, Gutenberg docs, and potentially all WordPress docs
    - [ ] Discuss using it as a drop-in static site generator replacement (e.g. Jekyll)
- [ ] Static block markup editor
    - [ ] Build a simple plugin to import and export .html files representing specific WordPress pages from GitHub. 
    - [ ] Ship a Blueprint that loads [Playground Docs](https://github.com/WordPress/wordpress-playground/tree/trunk/packages/docs/site) into Playground
    - [ ] We need to have a real use-case for interacting with data liberation on a daily basis and this is one. It's a [super low-friction way of maintaining the Playground documentation](https://github.com/WordPress/wordpress-playground/discussions/1524) and WordPress-on-GitHub-pages in general.  (cc @bph @akirk)
- [ ] Reliable Playground ZIP export / import
    - [ ] Fork the [Sandbox Site](https://wordpress.org/plugins/playground/) plugin
    - [ ] Improve the SQL export to make it streamable and ensure there are absolutely no issues with escaping
    - [ ] Rewrite the exported and imported site URLs
    - [ ] Include extension points to enable custom treatment of any block attribute, database row etc. See one of the GitHub discussions referenced in #1888 
    - [ ] Consider shipping `.sql` files with the export to potentially enable importing the resulting `.zip` in a regular MySQL-based server environment
    - [ ] ...anything else actually?
- [ ] "Duplicate Playground" feature
    - [ ] Iteration 1: Pipe the ZIP export to ZIP import
    - [ ] Iteration 2: Mount `/wordpress-new` in the duplicated Playground instance, run the PHP export/import code to migrate the site from `/wordpress` there
    - [ ] Iteration 3: Keep track of progress, make it resumable regardless of when the process is interrupted. This would enable exporting really big sites
- [ ] Direct WordPress <-> WordPress transfer
    - [ ] Conceptually, this is like running _Duplicate Playground_ over the internet
    - [ ] Important to keep track of progress and resources versions using a vector clock
    - [ ] Export / Import UI with scope (users? posts? etc.), error info (image.jpg couldn't be fetched after 3 retries), and error resolution mechanism (specify a different url? upload that image? retry 4th time?)
- [ ] Live WordPress <-> WordPress data sync
    - [ ] Run the WordPress <-> WordPress transfer in a continuous way.
    - [ ] This is not about [_collaborative editing_](https://github.com/WordPress/gutenberg/discussions/65012) in the block editor, although there is likely an overlap around data synchronization.
- [ ] Importers version 2 and beyond
    - [ ] Subtasks outlined in [[Data Liberation] Entity Stream Importer](https://github.com/WordPress/wordpress-playground/issues/1980)
    - [ ] Import one post at a time, not "all static assets" and then "all posts". Identify each post's dependency graph and frontload that post's dependent data first.
    - [ ] Resume `.partial` assets download upon import pause and resume.
    - [ ] Resource quotas


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

Next Gen importers

Tracking – ongoing Issues and PRs

Parsing

Exporting

Importing

Data formats

Reliability

UI

Other

Related resources

Next phases: Future Data Liberation roadmap

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

Description

Next Gen importers

Tracking – ongoing Issues and PRs

Parsing

Exporting

Importing

Data formats

Reliability

UI

Other

Related resources

Next phases: Future Data Liberation roadmap

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions