Skip to content

Commit

Permalink
More HTMLRewriter docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Jarred-Sumner committed Jan 13, 2025
1 parent 70362c3 commit b754959
Show file tree
Hide file tree
Showing 3 changed files with 621 additions and 27 deletions.
359 changes: 339 additions & 20 deletions docs/api/html-rewriter.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,350 @@
Bun provides a fast native implementation of the `HTMLRewriter` pattern developed by Cloudflare. It provides a convenient, `EventListener`-like API for traversing and transforming HTML documents.
HTMLRewriter lets you use CSS selectors to transform HTML documents. Bun's implementation is based on Cloudflare's [lol-html](https://github.com/cloudflare/lol-html).

## Usage

A common usecase is rewriting URLs in HTML content. Here's an example that rewrites image sources and link URLs to use a CDN domain:

```ts
const rewriter = new HTMLRewriter();
// Replace all images with a rickroll
const rewriter = new HTMLRewriter().on("img", {
element(img) {
// Famous rickroll video thumbnail
img.setAttribute(
"src",
"https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg",
);

rewriter.on("*", {
element(el) {
console.log(el.tagName); // "body" | "div" | ...
// Wrap the image in a link to the video
img.before(
'<a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" target="_blank">',
{ html: true },
);
img.after("</a>", { html: true });

// Add some fun alt text
img.setAttribute("alt", "Definitely not a rickroll");
},
});
```

To parse and/or transform the HTML:

```ts#rewriter.ts
rewriter.transform(
new Response(`
<!DOCTYPE html>
// Example usage
const html = `
<html>
<!-- comment -->
<head>
<title>My First HTML Page</title>
</head>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
<img src="/cat.jpg">
<img src="dog.png">
<img src="https://example.com/bird.webp">
</body>
`));
</html>
`;

const result = rewriter.transform(html);
console.log(result);
```

This replaces all images with a thumbnail of Rick Astley and wraps each `<img>` in a link, producing a diff like this:

```html-diff
<html>
<body>
- <img src="/cat.jpg">
- <img src="dog.png">
- <img src="https://example.com/bird.webp">
+ <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" target="_blank">
+ <img src="https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg" alt="Definitely not a rickroll">
+ </a>
+ <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" target="_blank">
+ <img src="https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg" alt="Definitely not a rickroll">
+ </a>
+ <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" target="_blank">
+ <img src="https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg" alt="Definitely not a rickroll">
+ </a>
</body>
</html>
```

Now every image on the page will be replaced with a thumbnail of Rick Astley, and clicking any image will lead to [a very famous video](https://www.youtube.com/watch?v=dQw4w9WgXcQ).

## Features

### Element Handlers

The `on(selector, handlers)` method allows you to register handlers for HTML elements that match a CSS selector. The handlers are called for each matching element during parsing:

```ts
rewriter.on("div.content", {
// Handle elements
element(element) {
element.setAttribute("class", "new-content");
element.append("<p>New content</p>", { html: true });
},
// Handle text nodes
text(text) {
text.replace("new text");
},
// Handle comments
comments(comment) {
comment.remove();
},
});
```

View the full documentation on the [Cloudflare website](https://developers.cloudflare.com/workers/runtime-apis/html-rewriter/).
The handlers can be asynchronous and return a Promise. Note that async operations will block the transformation until they complete:

```ts
rewriter.on("div", {
async element(element) {
await Bun.sleep(1000);
element.setInnerContent("<span>replace</span>", { html: true });
},
});
```

### CSS Selector Support

The `on()` method supports a wide range of CSS selectors:

```ts
// Tag selectors
rewriter.on("p", handler);

// Class selectors
rewriter.on("p.red", handler);

// ID selectors
rewriter.on("h1#header", handler);

// Attribute selectors
rewriter.on("p[data-test]", handler); // Has attribute
rewriter.on('p[data-test="one"]', handler); // Exact match
rewriter.on('p[data-test="one" i]', handler); // Case-insensitive
rewriter.on('p[data-test="one" s]', handler); // Case-sensitive
rewriter.on('p[data-test~="two"]', handler); // Word match
rewriter.on('p[data-test^="a"]', handler); // Starts with
rewriter.on('p[data-test$="1"]', handler); // Ends with
rewriter.on('p[data-test*="b"]', handler); // Contains
rewriter.on('p[data-test|="a"]', handler); // Dash-separated

// Combinators
rewriter.on("div span", handler); // Descendant
rewriter.on("div > span", handler); // Direct child

// Pseudo-classes
rewriter.on("p:nth-child(2)", handler);
rewriter.on("p:first-child", handler);
rewriter.on("p:nth-of-type(2)", handler);
rewriter.on("p:first-of-type", handler);
rewriter.on("p:not(:first-child)", handler);

// Universal selector
rewriter.on("*", handler);
```

### Element Operations

Elements provide various methods for manipulation. All modification methods return the element instance for chaining:

```ts
rewriter.on("div", {
element(el) {
// Attributes
el.setAttribute("class", "new-class").setAttribute("data-id", "123");

const classAttr = el.getAttribute("class"); // "new-class"
const hasId = el.hasAttribute("id"); // boolean
el.removeAttribute("class");

// Content manipulation
el.setInnerContent("New content"); // Escapes HTML by default
el.setInnerContent("<p>HTML content</p>", { html: true }); // Parses HTML
el.setInnerContent(""); // Clear content

// Position manipulation
el.before("Content before")
.after("Content after")
.prepend("First child")
.append("Last child");

// HTML content insertion
el.before("<span>before</span>", { html: true })
.after("<span>after</span>", { html: true })
.prepend("<span>first</span>", { html: true })
.append("<span>last</span>", { html: true });

// Removal
el.remove(); // Remove element and contents
el.removeAndKeepContent(); // Remove only the element tags

// Properties
console.log(el.tagName); // Lowercase tag name
console.log(el.namespaceURI); // Element's namespace URI
console.log(el.selfClosing); // Whether element is self-closing (e.g. <div />)
console.log(el.canHaveContent); // Whether element can contain content (false for void elements like <br>)
console.log(el.removed); // Whether element was removed

// Attributes iteration
for (const [name, value] of el.attributes) {
console.log(name, value);
}

// End tag handling
el.onEndTag(endTag => {
endTag.before("Before end tag");
endTag.after("After end tag");
endTag.remove(); // Remove the end tag
console.log(endTag.name); // Tag name in lowercase
});
},
});
```

### Text Operations

Text handlers provide methods for text manipulation. Text chunks represent portions of text content and provide information about their position in the text node:

```ts
rewriter.on("p", {
text(text) {
// Content
console.log(text.text); // Text content
console.log(text.lastInTextNode); // Whether this is the last chunk
console.log(text.removed); // Whether text was removed

// Manipulation
text.before("Before text").after("After text").replace("New text").remove();

// HTML content insertion
text
.before("<span>before</span>", { html: true })
.after("<span>after</span>", { html: true })
.replace("<span>replace</span>", { html: true });
},
});
```

### Comment Operations

Comment handlers allow comment manipulation with similar methods to text nodes:

```ts
rewriter.on("*", {
comments(comment) {
// Content
console.log(comment.text); // Comment text
comment.text = "New comment text"; // Set comment text
console.log(comment.removed); // Whether comment was removed

// Manipulation
comment
.before("Before comment")
.after("After comment")
.replace("New comment")
.remove();

// HTML content insertion
comment
.before("<span>before</span>", { html: true })
.after("<span>after</span>", { html: true })
.replace("<span>replace</span>", { html: true });
},
});
```

### Document Handlers

The `onDocument(handlers)` method allows you to handle document-level events. These handlers are called for events that occur at the document level rather than within specific elements:

```ts
rewriter.onDocument({
// Handle doctype
doctype(doctype) {
console.log(doctype.name); // "html"
console.log(doctype.publicId); // public identifier if present
console.log(doctype.systemId); // system identifier if present
},
// Handle text nodes
text(text) {
console.log(text.text);
},
// Handle comments
comments(comment) {
console.log(comment.text);
},
// Handle document end
end(end) {
end.append("<!-- Footer -->", { html: true });
},
});
```

## Input Types

HTMLRewriter can transform HTML from various sources. The input is automatically handled based on its type:

```ts
// From Response
rewriter.transform(new Response("<div>content</div>"));

// From string
rewriter.transform("<div>content</div>");

// From ArrayBuffer
rewriter.transform(new TextEncoder().encode("<div>content</div>").buffer);

// From Blob
rewriter.transform(new Blob(["<div>content</div>"]));

// From File
rewriter.transform(Bun.file("index.html"));
```

### Response Handling

When transforming a Response:

- The status code, headers, and other response properties are preserved
- The body is transformed while maintaining streaming capabilities
- Content-encoding (like gzip) is handled automatically
- The original response body is marked as used after transformation
- Headers are cloned to the new response

## Error Handling

HTMLRewriter operations can throw errors in several cases:

- Invalid selector syntax in `on()` method
- Invalid HTML content in transformation methods
- Stream errors when processing Response bodies
- Memory allocation failures
- Invalid input types (e.g., passing Symbol)
- Body already used errors

Errors should be caught and handled appropriately:

```ts
try {
const result = rewriter.transform(input);
// Process result
} catch (error) {
console.error("HTMLRewriter error:", error);
}
```

## Performance

Bun's HTMLRewriter is built on [lol-html](https://github.com/cloudflare/lol-html), a low-level HTML rewriting library written in Rust. This provides excellent performance characteristics:

- Streaming processing of HTML content
- Minimal memory overhead
- No DOM construction unless necessary
- Efficient CSS selector matching
- Configurable parsing buffer sizes (defaults to max(input_size, 1024) bytes)

The implementation uses efficient memory management and streaming techniques:

- Content is processed in chunks
- Backpressure is handled automatically
- Memory is released as soon as possible
- Large files are handled efficiently
- Automatic handling of compressed content (gzip)

You can also read the [Cloudflare documentation](https://developers.cloudflare.com/workers/runtime-apis/html-rewriter/), which this API is intended to be compatible with.
Loading

0 comments on commit b754959

Please sign in to comment.