-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data Liberation] WP_Stream_Importer with support for WXR and Markdown files #1982
Conversation
packages/playground/data-liberation/src/WP_Directory_Reader.php
Outdated
Show resolved
Hide resolved
|
||
require_once __DIR__ . '/bootstrap.php'; | ||
|
||
$reader = new WP_Serialized_Pages_Reader(__DIR__ . '/../../docs/site/docs'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure I'm following correctly.
Is the idea here that any project would implement their own instance of this WP_Serialized_Pages_Reader
in order to customize how front-matter should be read from that specific project?
For example, in my WPGraphQL Docs, I have frontmatter like so:
title: Contributing
uri: `/docs/contributing`
So, following this example, I could map my specific front-matter in the .md files I'm importing to whatever field(s) I want them to map to when being imported as a WordPress post?
Or is this something that's always running when front-matter is detected in .md and there would be a different mechanism (add_filter, for example) to custom map front-matter keys/values to WordPress (wxr) keys/values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, following this example, I could map my specific front-matter in the .md files I'm importing to whatever field(s) I want them to map to when being imported as a WordPress post?
Exactly! I think every project will have to handle its own frontmatter. I'm not aware of any unified schema for frontmatter metadata and I've seen a few different variations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I like add_filter()
maybe even more – it makes extensibility easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I was thinking there could be some documented defaults.
i.e.
Frontmatter | WXR | |||
---|---|---|---|---|
title | post_title | |||
date | post_date | |||
status | post_status | |||
... | ... |
Then if folks use those documented defaults in frontmatter already, things will "just work", but if they need to customize the mappings they could do so via a custom php snippet added in "steps" or a custom plugin loaded or whatever 🤔
Here's a (pseudo) example:
add_filter( 'wp_playground_map_front_matter_to_wxr', function( $wxr, $unfiltered_frontmatter ) {
// do some logic based on the frontmatter key and map it to wxr
if ( isset( $unfiltered_frontmatter['something'] ) {
$wxr['wp:some_meta_key'] => $unfiltered_frontmatter['something'];
}
return $wxr;
} );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that! I'd just keep the entire WXR pipeline completely separate and treat Markdown -> WordPress independently. In this scenario, we'd map the frontmatter keys directly to wp_insert_post
keys. So there would be a WXR reader, Markdown reader using that filter, and a single unified Importer accepting inputs from these and other readers.
…XR/Markdown importer abstraction to work
…create_from_string"
WXRReader is now a real pull reader – it automatically pulls data from the upstream byte reader, whether it's a local file, gzipped file, or a remote HTTP resource.
Motivation for the change, related issues
Adds
WP_Stream_Importer
– a generalized importer for arbitrary data. It comes with two data sources:WP_WXR_Reader
that streams entities from a WXR fileWP_Markdown_Directory_Tree_Reader
that turns a markdown directory intopage
entitiesWP_Stream_Importer
This is a draft of a re-entrant stream importer designed for importing very large datasets with minimal overhead. The few core ideas are:
SELECT * FROM wp_posts WHERE guid = :guid
huge_file.zip
downloaded).Entities
This is a generalized data importer, not a WXR importer. WXR is just one of possible data sources. This design enables importing markdown files, Blogger exports, Tumblr blogs etc. without having to rewrite that data as WXR.
The basic unit of data is an "entity" – a simple PHP array with post, tag, comment etc. data. Entities can be sourced from WXR and Markdown files – the relevant classes are described below.
Multiple passes
Every import will require multiple passes over the stream of entities to:
User input
The proposed importer is not a single "start and forget" device. It could be configured as such, but by default it will require the user to review the process – sometimes multiple times. Here's a few examples of such touchpoints:
<img>
tags from the content? Because they are referenced in these posts: (list of posts)parent_id
23, but there is no such parent. Do you want to set another parent? Or make it a top-level post? Or ignore it?If a webhost would rather avoid asking the user all these questions, the future importer API may enable forcing each of these decision.
WP_WXR_Reader
Streaming
The WXR reader supports the usual streaming interface with
append_bytes()
,is_paused_on_incomplete_input()
et al.It also comes with a new
connect_upstream( $byte_source )
method that allows it to automatically pull new data chunks from a data source:This way the consumer code never needs to worry about appending bytes, checking for EOF and such.
This PR also ships a few byte sources. Shaping more than one helped me notice patterns and propose v1 of the interface:
WP_File_Reader
– streams bytes from a local fileWP_GZ_File_Reader
– streams bytes from a gzipped local fileWP_Remote_File_Reader
– streams bytes over HTTPSWP_Remote_File_Ranged_Reader
– streams specific byte ranges over HTTPSWP_Markdown_Directory_Tree_Reader
This class traverses a directory tree and transforms all the
.md
files intopage
entity objects that can be processed byWP_Entity_Importer
:WP_Markdown_To_Blocks
We don't just save raw Markdown data as
post_content
. Not at all!This PR ships a
WP_Markdown_To_Blocks
class that:League\CommonMark
library. It supports frontmatter and GitHub-flavored syntax such as tables, but it's also bulky and likely not PHP 7.2-compatible. For inclusion in WordPress core, we may need to roll out our own Markdown parser, or fork theLeague\CommonMark
one and downgrade it to PHP 7.2.Other stuff
This PR also:
@php-wasm/compile
– Adds more Asyncify functions to the PHP WASM Dockerfile@wp-playground/cli
– buffers the downloads to a.partial
file to avoid assuming the file is already cached in case the download have failed.Follow-up work
@TODO
s and implement themTesting instructions
Confirm the CI tests pass. This code isn't actually used anywhere yet so there isn't a better way.