-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
70362c3
commit b754959
Showing
3 changed files
with
621 additions
and
27 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,31 +1,350 @@ | ||
Bun provides a fast native implementation of the `HTMLRewriter` pattern developed by Cloudflare. It provides a convenient, `EventListener`-like API for traversing and transforming HTML documents. | ||
HTMLRewriter lets you use CSS selectors to transform HTML documents. Bun's implementation is based on Cloudflare's [lol-html](https://github.com/cloudflare/lol-html). | ||
|
||
## Usage | ||
|
||
A common usecase is rewriting URLs in HTML content. Here's an example that rewrites image sources and link URLs to use a CDN domain: | ||
|
||
```ts | ||
const rewriter = new HTMLRewriter(); | ||
// Replace all images with a rickroll | ||
const rewriter = new HTMLRewriter().on("img", { | ||
element(img) { | ||
// Famous rickroll video thumbnail | ||
img.setAttribute( | ||
"src", | ||
"https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg", | ||
); | ||
|
||
rewriter.on("*", { | ||
element(el) { | ||
console.log(el.tagName); // "body" | "div" | ... | ||
// Wrap the image in a link to the video | ||
img.before( | ||
'<a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" target="_blank">', | ||
{ html: true }, | ||
); | ||
img.after("</a>", { html: true }); | ||
|
||
// Add some fun alt text | ||
img.setAttribute("alt", "Definitely not a rickroll"); | ||
}, | ||
}); | ||
``` | ||
|
||
To parse and/or transform the HTML: | ||
|
||
```ts#rewriter.ts | ||
rewriter.transform( | ||
new Response(` | ||
<!DOCTYPE html> | ||
// Example usage | ||
const html = ` | ||
<html> | ||
<!-- comment --> | ||
<head> | ||
<title>My First HTML Page</title> | ||
</head> | ||
<body> | ||
<h1>My First Heading</h1> | ||
<p>My first paragraph.</p> | ||
<img src="/cat.jpg"> | ||
<img src="dog.png"> | ||
<img src="https://example.com/bird.webp"> | ||
</body> | ||
`)); | ||
</html> | ||
`; | ||
|
||
const result = rewriter.transform(html); | ||
console.log(result); | ||
``` | ||
|
||
This replaces all images with a thumbnail of Rick Astley and wraps each `<img>` in a link, producing a diff like this: | ||
|
||
```html-diff | ||
<html> | ||
<body> | ||
- <img src="/cat.jpg"> | ||
- <img src="dog.png"> | ||
- <img src="https://example.com/bird.webp"> | ||
+ <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" target="_blank"> | ||
+ <img src="https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg" alt="Definitely not a rickroll"> | ||
+ </a> | ||
+ <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" target="_blank"> | ||
+ <img src="https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg" alt="Definitely not a rickroll"> | ||
+ </a> | ||
+ <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" target="_blank"> | ||
+ <img src="https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg" alt="Definitely not a rickroll"> | ||
+ </a> | ||
</body> | ||
</html> | ||
``` | ||
|
||
Now every image on the page will be replaced with a thumbnail of Rick Astley, and clicking any image will lead to [a very famous video](https://www.youtube.com/watch?v=dQw4w9WgXcQ). | ||
|
||
## Features | ||
|
||
### Element Handlers | ||
|
||
The `on(selector, handlers)` method allows you to register handlers for HTML elements that match a CSS selector. The handlers are called for each matching element during parsing: | ||
|
||
```ts | ||
rewriter.on("div.content", { | ||
// Handle elements | ||
element(element) { | ||
element.setAttribute("class", "new-content"); | ||
element.append("<p>New content</p>", { html: true }); | ||
}, | ||
// Handle text nodes | ||
text(text) { | ||
text.replace("new text"); | ||
}, | ||
// Handle comments | ||
comments(comment) { | ||
comment.remove(); | ||
}, | ||
}); | ||
``` | ||
|
||
View the full documentation on the [Cloudflare website](https://developers.cloudflare.com/workers/runtime-apis/html-rewriter/). | ||
The handlers can be asynchronous and return a Promise. Note that async operations will block the transformation until they complete: | ||
|
||
```ts | ||
rewriter.on("div", { | ||
async element(element) { | ||
await Bun.sleep(1000); | ||
element.setInnerContent("<span>replace</span>", { html: true }); | ||
}, | ||
}); | ||
``` | ||
|
||
### CSS Selector Support | ||
|
||
The `on()` method supports a wide range of CSS selectors: | ||
|
||
```ts | ||
// Tag selectors | ||
rewriter.on("p", handler); | ||
|
||
// Class selectors | ||
rewriter.on("p.red", handler); | ||
|
||
// ID selectors | ||
rewriter.on("h1#header", handler); | ||
|
||
// Attribute selectors | ||
rewriter.on("p[data-test]", handler); // Has attribute | ||
rewriter.on('p[data-test="one"]', handler); // Exact match | ||
rewriter.on('p[data-test="one" i]', handler); // Case-insensitive | ||
rewriter.on('p[data-test="one" s]', handler); // Case-sensitive | ||
rewriter.on('p[data-test~="two"]', handler); // Word match | ||
rewriter.on('p[data-test^="a"]', handler); // Starts with | ||
rewriter.on('p[data-test$="1"]', handler); // Ends with | ||
rewriter.on('p[data-test*="b"]', handler); // Contains | ||
rewriter.on('p[data-test|="a"]', handler); // Dash-separated | ||
|
||
// Combinators | ||
rewriter.on("div span", handler); // Descendant | ||
rewriter.on("div > span", handler); // Direct child | ||
|
||
// Pseudo-classes | ||
rewriter.on("p:nth-child(2)", handler); | ||
rewriter.on("p:first-child", handler); | ||
rewriter.on("p:nth-of-type(2)", handler); | ||
rewriter.on("p:first-of-type", handler); | ||
rewriter.on("p:not(:first-child)", handler); | ||
|
||
// Universal selector | ||
rewriter.on("*", handler); | ||
``` | ||
|
||
### Element Operations | ||
|
||
Elements provide various methods for manipulation. All modification methods return the element instance for chaining: | ||
|
||
```ts | ||
rewriter.on("div", { | ||
element(el) { | ||
// Attributes | ||
el.setAttribute("class", "new-class").setAttribute("data-id", "123"); | ||
|
||
const classAttr = el.getAttribute("class"); // "new-class" | ||
const hasId = el.hasAttribute("id"); // boolean | ||
el.removeAttribute("class"); | ||
|
||
// Content manipulation | ||
el.setInnerContent("New content"); // Escapes HTML by default | ||
el.setInnerContent("<p>HTML content</p>", { html: true }); // Parses HTML | ||
el.setInnerContent(""); // Clear content | ||
|
||
// Position manipulation | ||
el.before("Content before") | ||
.after("Content after") | ||
.prepend("First child") | ||
.append("Last child"); | ||
|
||
// HTML content insertion | ||
el.before("<span>before</span>", { html: true }) | ||
.after("<span>after</span>", { html: true }) | ||
.prepend("<span>first</span>", { html: true }) | ||
.append("<span>last</span>", { html: true }); | ||
|
||
// Removal | ||
el.remove(); // Remove element and contents | ||
el.removeAndKeepContent(); // Remove only the element tags | ||
|
||
// Properties | ||
console.log(el.tagName); // Lowercase tag name | ||
console.log(el.namespaceURI); // Element's namespace URI | ||
console.log(el.selfClosing); // Whether element is self-closing (e.g. <div />) | ||
console.log(el.canHaveContent); // Whether element can contain content (false for void elements like <br>) | ||
console.log(el.removed); // Whether element was removed | ||
|
||
// Attributes iteration | ||
for (const [name, value] of el.attributes) { | ||
console.log(name, value); | ||
} | ||
|
||
// End tag handling | ||
el.onEndTag(endTag => { | ||
endTag.before("Before end tag"); | ||
endTag.after("After end tag"); | ||
endTag.remove(); // Remove the end tag | ||
console.log(endTag.name); // Tag name in lowercase | ||
}); | ||
}, | ||
}); | ||
``` | ||
|
||
### Text Operations | ||
|
||
Text handlers provide methods for text manipulation. Text chunks represent portions of text content and provide information about their position in the text node: | ||
|
||
```ts | ||
rewriter.on("p", { | ||
text(text) { | ||
// Content | ||
console.log(text.text); // Text content | ||
console.log(text.lastInTextNode); // Whether this is the last chunk | ||
console.log(text.removed); // Whether text was removed | ||
|
||
// Manipulation | ||
text.before("Before text").after("After text").replace("New text").remove(); | ||
|
||
// HTML content insertion | ||
text | ||
.before("<span>before</span>", { html: true }) | ||
.after("<span>after</span>", { html: true }) | ||
.replace("<span>replace</span>", { html: true }); | ||
}, | ||
}); | ||
``` | ||
|
||
### Comment Operations | ||
|
||
Comment handlers allow comment manipulation with similar methods to text nodes: | ||
|
||
```ts | ||
rewriter.on("*", { | ||
comments(comment) { | ||
// Content | ||
console.log(comment.text); // Comment text | ||
comment.text = "New comment text"; // Set comment text | ||
console.log(comment.removed); // Whether comment was removed | ||
|
||
// Manipulation | ||
comment | ||
.before("Before comment") | ||
.after("After comment") | ||
.replace("New comment") | ||
.remove(); | ||
|
||
// HTML content insertion | ||
comment | ||
.before("<span>before</span>", { html: true }) | ||
.after("<span>after</span>", { html: true }) | ||
.replace("<span>replace</span>", { html: true }); | ||
}, | ||
}); | ||
``` | ||
|
||
### Document Handlers | ||
|
||
The `onDocument(handlers)` method allows you to handle document-level events. These handlers are called for events that occur at the document level rather than within specific elements: | ||
|
||
```ts | ||
rewriter.onDocument({ | ||
// Handle doctype | ||
doctype(doctype) { | ||
console.log(doctype.name); // "html" | ||
console.log(doctype.publicId); // public identifier if present | ||
console.log(doctype.systemId); // system identifier if present | ||
}, | ||
// Handle text nodes | ||
text(text) { | ||
console.log(text.text); | ||
}, | ||
// Handle comments | ||
comments(comment) { | ||
console.log(comment.text); | ||
}, | ||
// Handle document end | ||
end(end) { | ||
end.append("<!-- Footer -->", { html: true }); | ||
}, | ||
}); | ||
``` | ||
|
||
## Input Types | ||
|
||
HTMLRewriter can transform HTML from various sources. The input is automatically handled based on its type: | ||
|
||
```ts | ||
// From Response | ||
rewriter.transform(new Response("<div>content</div>")); | ||
|
||
// From string | ||
rewriter.transform("<div>content</div>"); | ||
|
||
// From ArrayBuffer | ||
rewriter.transform(new TextEncoder().encode("<div>content</div>").buffer); | ||
|
||
// From Blob | ||
rewriter.transform(new Blob(["<div>content</div>"])); | ||
|
||
// From File | ||
rewriter.transform(Bun.file("index.html")); | ||
``` | ||
|
||
### Response Handling | ||
|
||
When transforming a Response: | ||
|
||
- The status code, headers, and other response properties are preserved | ||
- The body is transformed while maintaining streaming capabilities | ||
- Content-encoding (like gzip) is handled automatically | ||
- The original response body is marked as used after transformation | ||
- Headers are cloned to the new response | ||
|
||
## Error Handling | ||
|
||
HTMLRewriter operations can throw errors in several cases: | ||
|
||
- Invalid selector syntax in `on()` method | ||
- Invalid HTML content in transformation methods | ||
- Stream errors when processing Response bodies | ||
- Memory allocation failures | ||
- Invalid input types (e.g., passing Symbol) | ||
- Body already used errors | ||
|
||
Errors should be caught and handled appropriately: | ||
|
||
```ts | ||
try { | ||
const result = rewriter.transform(input); | ||
// Process result | ||
} catch (error) { | ||
console.error("HTMLRewriter error:", error); | ||
} | ||
``` | ||
|
||
## Performance | ||
|
||
Bun's HTMLRewriter is built on [lol-html](https://github.com/cloudflare/lol-html), a low-level HTML rewriting library written in Rust. This provides excellent performance characteristics: | ||
|
||
- Streaming processing of HTML content | ||
- Minimal memory overhead | ||
- No DOM construction unless necessary | ||
- Efficient CSS selector matching | ||
- Configurable parsing buffer sizes (defaults to max(input_size, 1024) bytes) | ||
|
||
The implementation uses efficient memory management and streaming techniques: | ||
|
||
- Content is processed in chunks | ||
- Backpressure is handled automatically | ||
- Memory is released as soon as possible | ||
- Large files are handled efficiently | ||
- Automatic handling of compressed content (gzip) | ||
|
||
You can also read the [Cloudflare documentation](https://developers.cloudflare.com/workers/runtime-apis/html-rewriter/), which this API is intended to be compatible with. |
Oops, something went wrong.