Skip to content

Commit 9d91cc3

Browse files
committed
RFC-002: Rule-based content manipulation
1 parent 66f75b3 commit 9d91cc3

File tree

1 file changed

+145
-0
lines changed

1 file changed

+145
-0
lines changed
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
---
2+
state: Draft
3+
start-date: 2024-07-07
4+
author: Rodrigo Arias Mallo <[email protected]>
5+
---
6+
7+
# Dillo RFC 002 - Rule-based content manipulation
8+
9+
## Abstract
10+
11+
Defines a rule-based language to describe how to manipulate content as it is
12+
fetched or requested by Dillo. This rule mechanism allows rewriting web pages,
13+
traduce other file formats to HTML and also implementing new protocols. It
14+
supersedes the current DPI infrastructure.
15+
16+
## Motivation
17+
18+
One of the shortcomings of the Dillo plugin mechanism (DPI) is that it can only
19+
operate at the protocol level. That is, a program is assigned to a protocol, for
20+
example "gemini:" and then all browsing that requests URIs of that protocol is
21+
forwarded to the given plugin.
22+
23+
The drawback of this design is that it mixes the content with the protocol. In
24+
the case of the Gemini protocol, the usual file format is Gemtext, which is
25+
similar to Markdown. However, if a Gemtext file is fetched via HTTP or locally
26+
via the "file:" protocol there is no current way to translate it into HTML, in
27+
the same way a Gemini plugin would do.
28+
29+
Another problem with the current design is that it can only operate at the
30+
granularity of complete requests. For example, the user clicks on a link that
31+
opens a given protocol and that is forwarded to the given plugin, without any
32+
other possibility.
33+
34+
By allowing plugins to be able to rewrite content on their own, they can use
35+
information of the current request to determine how to perform the rewrite
36+
process. For example, a plugin may only operate on a set of domains, or when
37+
certain HTTP headers are in the response.
38+
39+
## Design considerations
40+
41+
The goals of the design is to have a flexible mechanism to describe how to
42+
perform the manipulation while we keep it simple to understand for users.
43+
44+
### Rule language
45+
46+
Using a simple rule language we can build a set of rules that can be quickly
47+
evaluated in runtime. These rules have the capability to run arbitrary commands
48+
that the user specifies, which are capable of manipulating the traffic.
49+
50+
They can also operate in such a way that they behave as endpoints, so they can
51+
implement protocols on their own.
52+
53+
### Performance
54+
55+
As users can add a long list of manipulations with complicated matching
56+
criteria, we should ensure that we don't introduce a lot of overhead in each
57+
request or response.
58+
59+
A way to avoid this overhead is by having a restricted set of rules that can
60+
only operate on data that is already parsed by Dillo, so it doesn't have to be
61+
parse by each plugin.
62+
63+
### Domain matching
64+
65+
Let's consider the case where we want to match a particular domain. If we let
66+
each plugin determine if the domain has to be intercepted or not, that would
67+
cause the execution of every plugin in each request. However, by having a single
68+
hash table where we store plugins that should process that request, we can
69+
determine where to reroute the request in O(1) time.
70+
71+
Similarly, we could allow users to match domains by using a regex, but that
72+
would introduce a much larger cost, as we would have to match all the regex
73+
rules for every request. A simple solution is to match the domain first, and
74+
then use the regex to further restrict the match. This will distribute the
75+
regex matching overhead among the domains.
76+
77+
### HTTP header matching
78+
79+
Rules may choose to match if a header is present (or absent), or if it is
80+
present and it contains a given value. To avoid parsing again the HTTP headers,
81+
we perform the parsing from Dillo and then match the rules.
82+
83+
Only the rules that match the domain (with the optional domain regex) or the
84+
ones that are for any domain should be processed here.
85+
86+
## Implementation details
87+
88+
Dillo currently builds a chain of modules that performs some processing on the
89+
incoming and outgoing data:
90+
91+
92+
(0) +--------+(1) +-------+(2) +------+(3) +-------+
93+
---->| TLS IO |--->| IO |--->| HTTP |--->| CACHE |-...
94+
Net +--------+ +-------+ +------+ +-------+
95+
src/tls.c src/IO.c src/http.c src/capi.c
96+
97+
The user should be able to decide at which stage the rules are hooked. For
98+
example, at (0) we TLS traffic is still encrypted, so there is only a limited
99+
actions that can be done there.
100+
101+
At (1,2) we see the HTTP traffic, but it is still compressed (if any). At (3) we
102+
see it uncompressed, and is the last step before being cached.
103+
104+
Here is an example where we introduce a new module "SED" that sees the incoming
105+
uncompressed HTTP traffic and can perform modifications:
106+
107+
Net +--------+ +-------+ +------+ +=====+ +-------+
108+
---->| TLS IO |--->| IO |--->| HTTP |---># SED #--->| CACHE |-...
109+
+--------+ +-------+ +------+ +=====+ +-------+
110+
src/tls.c src/IO.c src/http.c | src/capi.c
111+
|
112+
+---------+
113+
| rulesrc |
114+
| ... |
115+
+---------+
116+
117+
## Feature creep
118+
119+
This design introduces more complexity in the Dillo code base. However, trying
120+
to manage this feature outside Dillo doesn't seem to be possible, as we need to
121+
be able to reroute traffic on the different layers.
122+
123+
On the other hand, we can design the rule language in such a way that we only
124+
allow operations that are quick to evaluate in runtime to reduce the overhead.
125+
126+
## Validation
127+
128+
When implemented, we should be able to do the following:
129+
130+
- Rewrite HTML pages to correct bugs or introduce new content such as meta
131+
information in the `<head>` that is rewritten as visible HTML elements. An
132+
example of such elements are RSS feeds.
133+
134+
- Patch CSS per page. As we can hook the rules to match different properties, we
135+
can use them to inject new CSS rules or patch the given ones to the user
136+
liking. This allows fixing broken rules or use fallback features while we add
137+
support for new CSS features.
138+
139+
- Handle HTTP error status codes like 404 or 500 and redirect them to the web
140+
archive.
141+
142+
- Redirect JS-only pages to alternatives that can be rendered in Dillo,
143+
similarly as the [libredirect plugin](https://libredirect.github.io/).
144+
145+
- Replace the current limited DPI mechanism for plugins.

0 commit comments

Comments
 (0)