More HTMLRewriter docs

oven-sh · Jan 13, 2025 · b754959 · b754959
1 parent 70362c3
commit b754959
Show file tree

Hide file tree

Showing 3 changed files with 621 additions and 27 deletions.
diff --git a/docs/api/html-rewriter.md b/docs/api/html-rewriter.md
@@ -1,31 +1,350 @@
-Bun provides a fast native implementation of the `HTMLRewriter` pattern developed by Cloudflare. It provides a convenient, `EventListener`-like API for traversing and transforming HTML documents.
+HTMLRewriter lets you use CSS selectors to transform HTML documents. Bun's implementation is based on Cloudflare's [lol-html](https://github.com/cloudflare/lol-html).
+
+## Usage
+
+A common usecase is rewriting URLs in HTML content. Here's an example that rewrites image sources and link URLs to use a CDN domain:
 
 ```ts
-const rewriter = new HTMLRewriter();
+// Replace all images with a rickroll
+const rewriter = new HTMLRewriter().on("img", {
+  element(img) {
+    // Famous rickroll video thumbnail
+    img.setAttribute(
+      "src",
+      "https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg",
+    );
 
-rewriter.on("*", {
-  element(el) {
-    console.log(el.tagName); // "body" | "div" | ...
+    // Wrap the image in a link to the video
+    img.before(
+      '<a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" target="_blank">',
+      { html: true },
+    );
+    img.after("</a>", { html: true });
+
+    // Add some fun alt text
+    img.setAttribute("alt", "Definitely not a rickroll");
   },
 });
-```
-
-To parse and/or transform the HTML:
 
-```ts#rewriter.ts
-rewriter.transform(
-  new Response(`
-<!DOCTYPE html>
+// Example usage
+const html = `
 <html>
-<!-- comment -->
-<head>
-  <title>My First HTML Page</title>
-</head>
 <body>
-  <h1>My First Heading</h1>
-  <p>My first paragraph.</p>
+  <img src="/cat.jpg">
+  <img src="dog.png">
+  <img src="https://example.com/bird.webp">
 </body>
-`));
+</html>
+`;
+
+const result = rewriter.transform(html);
+console.log(result);
+```
+
+This replaces all images with a thumbnail of Rick Astley and wraps each `<img>` in a link, producing a diff like this:
+
+```html-diff
+<html>
+  <body>
+-    <img src="/cat.jpg">
+-    <img src="dog.png">
+-    <img src="https://example.com/bird.webp">
++    <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" target="_blank">
++      <img src="https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg" alt="Definitely not a rickroll">
++    </a>
++    <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" target="_blank">
++      <img src="https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg" alt="Definitely not a rickroll">
++    </a>
++    <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" target="_blank">
++      <img src="https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg" alt="Definitely not a rickroll">
++    </a>
+  </body>
+</html>
+```
+
+Now every image on the page will be replaced with a thumbnail of Rick Astley, and clicking any image will lead to [a very famous video](https://www.youtube.com/watch?v=dQw4w9WgXcQ).
+
+## Features
+
+### Element Handlers
+
+The `on(selector, handlers)` method allows you to register handlers for HTML elements that match a CSS selector. The handlers are called for each matching element during parsing:
+
+```ts
+rewriter.on("div.content", {
+  // Handle elements
+  element(element) {
+    element.setAttribute("class", "new-content");
+    element.append("<p>New content</p>", { html: true });
+  },
+  // Handle text nodes
+  text(text) {
+    text.replace("new text");
+  },
+  // Handle comments
+  comments(comment) {
+    comment.remove();
+  },
+});
 ```
 
-View the full documentation on the [Cloudflare website](https://developers.cloudflare.com/workers/runtime-apis/html-rewriter/).
+The handlers can be asynchronous and return a Promise. Note that async operations will block the transformation until they complete:
+
+```ts
+rewriter.on("div", {
+  async element(element) {
+    await Bun.sleep(1000);
+    element.setInnerContent("<span>replace</span>", { html: true });
+  },
+});
+```
+
+### CSS Selector Support
+
+The `on()` method supports a wide range of CSS selectors:
+
+```ts
+// Tag selectors
+rewriter.on("p", handler);
+
+// Class selectors
+rewriter.on("p.red", handler);
+
+// ID selectors
+rewriter.on("h1#header", handler);
+
+// Attribute selectors
+rewriter.on("p[data-test]", handler); // Has attribute
+rewriter.on('p[data-test="one"]', handler); // Exact match
+rewriter.on('p[data-test="one" i]', handler); // Case-insensitive
+rewriter.on('p[data-test="one" s]', handler); // Case-sensitive
+rewriter.on('p[data-test~="two"]', handler); // Word match
+rewriter.on('p[data-test^="a"]', handler); // Starts with
+rewriter.on('p[data-test$="1"]', handler); // Ends with
+rewriter.on('p[data-test*="b"]', handler); // Contains
+rewriter.on('p[data-test|="a"]', handler); // Dash-separated
+
+// Combinators
+rewriter.on("div span", handler); // Descendant
+rewriter.on("div > span", handler); // Direct child
+
+// Pseudo-classes
+rewriter.on("p:nth-child(2)", handler);
+rewriter.on("p:first-child", handler);
+rewriter.on("p:nth-of-type(2)", handler);
+rewriter.on("p:first-of-type", handler);
+rewriter.on("p:not(:first-child)", handler);
+
+// Universal selector
+rewriter.on("*", handler);
+```
+
+### Element Operations
+
+Elements provide various methods for manipulation. All modification methods return the element instance for chaining:
+
+```ts
+rewriter.on("div", {
+  element(el) {
+    // Attributes
+    el.setAttribute("class", "new-class").setAttribute("data-id", "123");
+
+    const classAttr = el.getAttribute("class"); // "new-class"
+    const hasId = el.hasAttribute("id"); // boolean
+    el.removeAttribute("class");
+
+    // Content manipulation
+    el.setInnerContent("New content"); // Escapes HTML by default
+    el.setInnerContent("<p>HTML content</p>", { html: true }); // Parses HTML
+    el.setInnerContent(""); // Clear content
+
+    // Position manipulation
+    el.before("Content before")
+      .after("Content after")
+      .prepend("First child")
+      .append("Last child");
+
+    // HTML content insertion
+    el.before("<span>before</span>", { html: true })
+      .after("<span>after</span>", { html: true })
+      .prepend("<span>first</span>", { html: true })
+      .append("<span>last</span>", { html: true });
+
+    // Removal
+    el.remove(); // Remove element and contents
+    el.removeAndKeepContent(); // Remove only the element tags
+
+    // Properties
+    console.log(el.tagName); // Lowercase tag name
+    console.log(el.namespaceURI); // Element's namespace URI
+    console.log(el.selfClosing); // Whether element is self-closing (e.g. <div />)
+    console.log(el.canHaveContent); // Whether element can contain content (false for void elements like <br>)
+    console.log(el.removed); // Whether element was removed
+
+    // Attributes iteration
+    for (const [name, value] of el.attributes) {
+      console.log(name, value);
+    }
+
+    // End tag handling
+    el.onEndTag(endTag => {
+      endTag.before("Before end tag");
+      endTag.after("After end tag");
+      endTag.remove(); // Remove the end tag
+      console.log(endTag.name); // Tag name in lowercase
+    });
+  },
+});
+```
+
+### Text Operations
+
+Text handlers provide methods for text manipulation. Text chunks represent portions of text content and provide information about their position in the text node:
+
+```ts
+rewriter.on("p", {
+  text(text) {
+    // Content
+    console.log(text.text); // Text content
+    console.log(text.lastInTextNode); // Whether this is the last chunk
+    console.log(text.removed); // Whether text was removed
+
+    // Manipulation
+    text.before("Before text").after("After text").replace("New text").remove();
+
+    // HTML content insertion
+    text
+      .before("<span>before</span>", { html: true })
+      .after("<span>after</span>", { html: true })
+      .replace("<span>replace</span>", { html: true });
+  },
+});
+```
+
+### Comment Operations
+
+Comment handlers allow comment manipulation with similar methods to text nodes:
+
+```ts
+rewriter.on("*", {
+  comments(comment) {
+    // Content
+    console.log(comment.text); // Comment text
+    comment.text = "New comment text"; // Set comment text
+    console.log(comment.removed); // Whether comment was removed
+
+    // Manipulation
+    comment
+      .before("Before comment")
+      .after("After comment")
+      .replace("New comment")
+      .remove();
+
+    // HTML content insertion
+    comment
+      .before("<span>before</span>", { html: true })
+      .after("<span>after</span>", { html: true })
+      .replace("<span>replace</span>", { html: true });
+  },
+});
+```
+
+### Document Handlers
+
+The `onDocument(handlers)` method allows you to handle document-level events. These handlers are called for events that occur at the document level rather than within specific elements:
+
+```ts
+rewriter.onDocument({
+  // Handle doctype
+  doctype(doctype) {
+    console.log(doctype.name); // "html"
+    console.log(doctype.publicId); // public identifier if present
+    console.log(doctype.systemId); // system identifier if present
+  },
+  // Handle text nodes
+  text(text) {
+    console.log(text.text);
+  },
+  // Handle comments
+  comments(comment) {
+    console.log(comment.text);
+  },
+  // Handle document end
+  end(end) {
+    end.append("<!-- Footer -->", { html: true });
+  },
+});
+```
+
+## Input Types
+
+HTMLRewriter can transform HTML from various sources. The input is automatically handled based on its type:
+
+```ts
+// From Response
+rewriter.transform(new Response("<div>content</div>"));
+
+// From string
+rewriter.transform("<div>content</div>");
+
+// From ArrayBuffer
+rewriter.transform(new TextEncoder().encode("<div>content</div>").buffer);
+
+// From Blob
+rewriter.transform(new Blob(["<div>content</div>"]));
+
+// From File
+rewriter.transform(Bun.file("index.html"));
+```
+
+### Response Handling
+
+When transforming a Response:
+
+- The status code, headers, and other response properties are preserved
+- The body is transformed while maintaining streaming capabilities
+- Content-encoding (like gzip) is handled automatically
+- The original response body is marked as used after transformation
+- Headers are cloned to the new response
+
+## Error Handling
+
+HTMLRewriter operations can throw errors in several cases:
+
+- Invalid selector syntax in `on()` method
+- Invalid HTML content in transformation methods
+- Stream errors when processing Response bodies
+- Memory allocation failures
+- Invalid input types (e.g., passing Symbol)
+- Body already used errors
+
+Errors should be caught and handled appropriately:
+
+```ts
+try {
+  const result = rewriter.transform(input);
+  // Process result
+} catch (error) {
+  console.error("HTMLRewriter error:", error);
+}
+```
+
+## Performance
+
+Bun's HTMLRewriter is built on [lol-html](https://github.com/cloudflare/lol-html), a low-level HTML rewriting library written in Rust. This provides excellent performance characteristics:
+
+- Streaming processing of HTML content
+- Minimal memory overhead
+- No DOM construction unless necessary
+- Efficient CSS selector matching
+- Configurable parsing buffer sizes (defaults to max(input_size, 1024) bytes)
+
+The implementation uses efficient memory management and streaming techniques:
+
+- Content is processed in chunks
+- Backpressure is handled automatically
+- Memory is released as soon as possible
+- Large files are handled efficiently
+- Automatic handling of compressed content (gzip)
+
+You can also read the [Cloudflare documentation](https://developers.cloudflare.com/workers/runtime-apis/html-rewriter/), which this API is intended to be compatible with.