HTML search: serialize search index as JSON #13097

jayaddison · 2024-11-02T02:21:50Z

Feature or Bugfix

Bugfix

Purpose

Prevent queries for the JavaScript class prototype reference propery name from causing errors in the HTML search JavaScript code provided in the basic theme.

Detail

Ensure that the user input query term is found as a named property of the relevant index objects before accessing them, using Object.hasOwnProperty.
~~Represent the HTML search indices as JavaScript Map instances instead of Object literals.~~
Serialize the search index content as JSON, so that terms named __proto__ can be serialized and loaded.

Relates

Resolves HTML search: bug with JS-prototype-property query #13096.
~~May resolve # 13098 (conditional on inclusion of JSON.parse reviver / Object.freeze logic)~~

Edit: update description to reflect JavaScript Map-based approach.
Edit: update description to reflect JSON-serialization approach.

…fore accessing them

[skip ci]

wlach

I can see how this would be a problem but I'm a little unclear on how the solution actually works? See comments

wlach · 2024-11-05T16:14:33Z

sphinx/themes/basic/static/searchtools.js

@@ -514,8 +514,8 @@ const Search = {
    searchTerms.forEach((word) => {
      const files = [];
      const arr = [
-        { files: terms[word], score: Scorer.term },
-        { files: titleTerms[word], score: Scorer.title },
+        { files: terms.hasOwnProperty(word) ? terms[word] : [], score: Scorer.term },


I'm actually a little confused by how this might work, as it doesn't appear that JavaScript supports setting the __proto__ property in an object. Am I missing something? Should we be using a map to store these terms instead?

When parsing JSON, an object key named "__proto__" is accepted, and converted to the JavaScript object key ["__proto__"] -- distinct from the __proto__ property (a reference to a class prototype).

During read-access, introducing hasOwnProperty allows us to filter out the latter, so that we don't mistakenly retrieve the prototype class reference from terms.

I'm not certain whether the below example will help to clarify, but I find experimentation helpful to understand some of the behaviours:

> x = JSON.parse('{}') {} > y = JSON.parse('{"__proto__": "testing"}') { ['__proto__']: 'testing' } > '__proto__' in x true > x.hasOwnProperty('__proto__') false > '__proto__' in y true > y.hasOwnProperty('__proto__') true

I forgot to respond to your Map question: that is certainly possible, although I feel that doing so might add some runtime overhead (since we'd need to load the native JS object contents from the index into respective Map objects).

As I understand it, Map objects are also more difficult to make immutable in JavaScript than Object and Array instances. Freezing a Map object doesn't prevent its private contents from being updated:

> x = {'foo': 'bar'} { foo: 'bar' } > Object.freeze(x) { foo: 'bar' } > x['foo'] = 'baz' 'baz' > x['foo'] 'bar' > y = new Map([['foo', 'bar']]) Map(1) { 'foo' => 'bar' } > Object.freeze(y) Map(1) { 'foo' => 'bar' } > y.set('foo', 'baz') Map(1) { 'foo' => 'baz' } > y.get('foo') 'baz'

So another factor is that, to the extent that I'd like to see the search index contents become immutable at load-time in future (#13098), I'd prefer not to migrate to the Map type unless there is a compelling reason to (maybe avoiding prototype pollution is that reason -- but I don't think we're exposed to that currently).

Ok got it, this makes sense to me re: the property. TIL!

I'm a little bit less sure about whether the index should be immutable, but happy to defer that discussion to another day.

Actually, one extra piece of optional feedback: since this is kind of counter-intuitive, you might want to add a comment on this line explaining what you're doing.

Yep, OK - I'll try to add some phrasing to explain this succinctly. Thanks for the code review!

tests/js/roots/ecmascript/index.rst

The JS prototype property should _not_ be found in unrelated documents

wlach · 2024-11-11T13:53:00Z

sphinx/themes/basic/static/searchtools.js

@@ -514,8 +514,8 @@ const Search = {
    searchTerms.forEach((word) => {
      const files = [];
      const arr = [
-        { files: terms[word], score: Scorer.term },
-        { files: titleTerms[word], score: Scorer.title },
+        { files: terms.hasOwnProperty(word) ? terms[word] : [], score: Scorer.term },


Ok got it, this makes sense to me re: the property. TIL!

I'm a little bit less sure about whether the index should be immutable, but happy to defer that discussion to another day.

sphinx/themes/basic/static/searchtools.js

AA-Turner · 2024-11-13T22:03:42Z

Would this work as an alternative?

    // in Search.setIndex()
    const frozen_index = JSON.parse(index, (key, value) => {
      typeof value === "object" && value !== null
        ? Object.freeze(value)
        : value;
    });

If we change Search.setIndex() to taking a JSON string, it seems we could use the 'reviver' function to freeze arrays and objects?

A

jayaddison · 2024-11-14T01:08:49Z

@AA-Turner although I'd recommend some performance testing, that could be useul for #13098, yep.

jayaddison · 2024-11-14T11:53:40Z

In fact, no, my mistake: JSON.parse may in fact be necessary to solve this robustly. The test appears only to be passing due to the fact that __proto__ is found in the objects part of the searchindex.js file. The terms entry for __proto__ is not being handled correctly.

… index contents Co-authored-by: Adam Turner <[email protected]>

jayaddison · 2024-11-14T12:04:31Z

I've pushed the change to use JSON.parse and reviver -- a note of caution, though: when I was testing that against a JSON-ized version of the Python documentation searchindex.js file yesterday, I was encountering ~200ms duration within the setIndex function in my browser.

wlach · 2024-11-14T12:13:56Z

Personally, I'd be inclined to merge the previous version of this and address that (if at all) in a followup. Trying to "freeze" the index here feels like scope creep to me and it would be nice to get this fix in.

jayaddison · 2024-11-14T12:19:35Z

That's reasonable, yep, and I've no strong opinions on whether we include or omit the Object.freeze aspect. However, without 462c859, I don't believe this branch would genuinely fix the reported bug.

The `JSON.dumps` call escapes single-quotes within strings; all JSON strings are enclosed by double-quotes

jayaddison · 2024-11-14T12:28:08Z

Without the reviver function, I recall the setIndex + JSON.parse performance overhead being in the order of ~60ms on my local machine. I'll confirm those stats in some more detail within the next day or so. I get the feeling that some of this will be coupled to the browser+implementation (Firefox) details, so I may try to compare to Chrom[e|ium] also.

…e search index contents" This reverts commit 443fd21.

[skip ci]

Relates-to commit d4bba1e. [skip ci]

jayaddison · 2024-11-16T11:47:32Z

tests/test_search.py

+    indices = ast.literal_eval(
+        searchindex[16:-1]
+        .replace('new Map', '')
+        .replace('null', 'None')
+        .replace('false', 'False')
+    )


This is fragile and inelegant; it's a compromise to avoid introducing the complexity of a Python-implemented JavaScript parser dependency given the use of new Map(...) within searchindex.js files.

Another alternative would be to dump the index as JSON, then reload it into the actual format we want. Would almost certainly be a little slower, but probably not disastrously so.

Could you explain that in a little more detail? I think I understand what you mean, but I'd like to make sure.

Yeah I mean actually just writing out a JSON file and having the frontend code load it seperately via a fetch call. I'm not actually sure it would be slower, now that I think about it-- this post suggests this will be faster on the interpreter side: https://v8.dev/blog/cost-of-javascript-2019#json

Ok, thanks! (I wasn't entirely sure whether you were suggesting JSON format output only for the unit test code - using it in both seems more robust).

Let's give searchindex.json a try, in that case. It'll make a few things simpler, although the searchtools.js code may lose some of the readability improvements (we'll want to begin/reintroduce using obj.hasOwnProperty again in a few places).

We may also want to be careful about where we place the fetch call so that the browser doesn't block for unnecessarily long before invoking that (perhaps especially relevant for the case of downstream themes that include their own scripting that may also run on each page load).

Let's give searchindex.json a try, in that case. It'll make a few things simpler, although the searchtools.js code may lose some of the readability improvements (we'll want to begin/reintroduce using obj.hasOwnProperty again in a few places).

Personally I would read the search index as JSON, then convert it into a Map (i.e. building on the work you already did).

I'd prefer to avoid multiple read passes over the search index data on the client-side, if possible. Similarly that makes me reluctant to use the reviver function in JSON.parse, and I considered it a downside of the freeze-and-deproto code I had implemented in #13098.

Aside from code clarity, do you anticipate many benefits from the transformation to Map datastructures?

Aside from code clarity, do you anticipate many benefits from the transformation to Map datastructures?

Not really, my guess is that getting entries out of a map will be a little faster (since it's a datastructure purely oriented around storing and retrieving keys/values) but probably not noticeably so. I think it's at least worth exploring though. The conversion should just be a few extra lines of code after you've loaded the JSON structure.

Mm, I agree it'd only be a few lines of code, but I don't think that it'd be worth the resulting compute resource expenditure to do that.

Generally I feel inclined to disentangle the two aspects of this pull request: adding support for __proto__ as an indexed/queryable term, and preventing __proto__ queries from accessing JavaScript object prototypes.

The latter seems more important (it's a bug) - the former is a feature request that hasn't really been requested by anyone other than myself during development.

I've extracted the bugfix part of this changeset into #13153; I don't have any plans currently to add feature support for term/title-term matching of the query term __proto__ itself.

[skip ci]

jayaddison · 2024-11-16T17:45:57Z

Without the reviver function, I recall the setIndex + JSON.parse performance overhead being in the order of ~60ms on my local machine. I'll confirm those stats in some more detail within the next day or so. I get the feeling that some of this will be coupled to the browser+implementation (Firefox) details, so I may try to compare to Chrom[e|ium] also.

I haven't gotten around to this comprehensive performance testing yet, but want to share some notes:

Initial results

Initially the native Map approach (up to fae81a0) seems similar in runtime performance to the existing/baseline approach (object literals), based on my testing using Sphinx's self-built documentation.

The page-load searchindex-evaluation time on my machine for a query for phin (~700 results) is ~30ms in Firefox, compared to a ~20ms baseline. Those are both relatively small compared to a ~0.5s duration for display/rendering of the search results, but arguably that is a pessimal query (a more-precise match would reduce the display duration).

There is a small amount of index size bloat -- not so much from the new Map( and ) text overhead, but more from the additional Array literals (e.g. square brackets) used to format their entries instead of object literals. This is approximately 5% overhead on the test dataset; 541857 bytes compared to 516691 baseline.

Observations so far

There seem to be multiple factors to balance: Python complexity, JavaScript complexity, code safety (both datastructure immutability, and also prototype access), runtime performance and index filesize.

I'm not sure how much to explore and how to present my thoughts/findings. Generally my feeling is that the Map approach here (again, up to fae81a0) seems good on most criteria except for Python complexity (particularly the unit tests, but also the application serialization code is more complex than it was) and immutability (that it does not address, in my opinion). The tc39 Records-and-Tuples proposal, I think, would be near-optimal on all factors -- but we could be years away from that reaching widespread adoption.

wlach

There is a small amount of index size bloat -- not so much from the new Map( and ) text overhead, but more from the additional Array literals (e.g. square brackets) used to format their entries instead of object literals. This is approximately 5% overhead on the test dataset; 541857 bytes compared to 516691 baseline.

My guess is that most of this difference (which is relatively small) would disappear with compression.

There seem to be multiple factors to balance: Python complexity, JavaScript complexity, code safety (both datastructure immutability, and also prototype access), runtime performance and index filesize.

Yep, these things are always about tradeoffs.

I'm not sure how much to explore and how to present my thoughts/findings. Generally my feeling is that the Map approach here (again, up to fae81a0) seems good on most criteria except for Python complexity (particularly the unit tests, but also the application serialization code is more complex than it was) and immutability (that it does not address, in my opinion). The tc39 Records-and-Tuples proposal, I think, would be near-optimal on all factors -- but we could be years away from that reaching widespread adoption.

I think it's a pretty good solution. I'll defer approving in case you want to experiment / document more, but I think this fixes the bug while still being reasonable to maintain.

wlach · 2024-11-19T12:43:12Z

sphinx/search/__init__.py

+            for name, entries in sorted(data.items())
+        }
+        data_js = '{' + ','.join(f'{k}:{v}' for k, v in js_indices.items()) + '}'
+        return self.PREFIX + data_js + self.SUFFIX


Could we use a jinja template for the search index? That might make this look a little better. Seems to be one of Sphinx's dependencies:

sphinx/pyproject.toml

Line 65 in 80ed1b4

"Jinja2>=3.1",

Definitely worth adding a comment or two explaining what's going on.

I'm mulling over the suggestion; making the logic and output format of the method clearer would both be helpful, but I'd probably prefer not to include an additional technology in the mix (even though as you mention, it's already available as an application dependency).

I think we should use it if it's available and it makes things clearer. It's usage elsewhere in Sphinx is pretty extensive: https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20jinja&type=code

(although dumping the index as JSON is perhaps a better alternative)

@wlach I'm fairly sure Jinja was originally written for use in Sphinx / Flask -- they're all originally Pocco projects!

sphinx/themes/basic/static/searchtools.js

wlach · 2024-11-19T12:46:11Z

tests/test_search.py

+    indices = ast.literal_eval(
+        searchindex[16:-1]
+        .replace('new Map', '')
+        .replace('null', 'None')
+        .replace('false', 'False')
+    )


Another alternative would be to dump the index as JSON, then reload it into the actual format we want. Would almost certainly be a little slower, but probably not disastrously so.

jayaddison · 2024-11-19T22:02:20Z

There is a small amount of index size bloat -- not so much from the new Map( and ) text overhead, but more from the additional Array literals (e.g. square brackets) used to format their entries instead of object literals. This is approximately 5% overhead on the test dataset; 541857 bytes compared to 516691 baseline.

My guess is that most of this difference (which is relatively small) would disappear with compression.

I should have thought to evaluate that at the same time - from a quick check with brotli - indeed, the difference becomes fairly insigificant (102833 vs 102137, less than 1%). Perhaps it's helpful for compression that the characterset becomes slightly more predictable.

Conflicts: CHANGES.rst

Ref: https://fetch.spec.whatwg.org/#fetch-api

jayaddison · 2024-11-24T17:29:50Z

That's reasonable, yep, and I've no strong opinions on whether we include or omit the Object.freeze aspect. However, without 462c859, I don't believe this branch would genuinely fix the reported bug.

This was an error in my reasoning, and it may have caused a lot of unnecessary confusion (I confused myself with it, too).

The important part of the bug to fix here, in my opinion, is that a specific query -- for the term __proto__ -- is not handled correctly and produces an error and search result entitled undefined.

My change in 462c859 introduced something tangentially related, but different -- it added support for search index representation of, and subsequent querying of, __proto__ -- and doing that unnecessarily complicates the fix, because it introduces the various other tradeoff considerations we've been discussing.

My apologies for the extra work and meandering explorations caused by this.

wlach · 2024-11-28T12:32:21Z

My apologies for the extra work and meandering explorations caused by this.

No worries! I think this was a useful exercise and I think we might still want some version of your later work on this. Agreed that getting a small fix in for the originally reported bug is a good incremental step worth doing on its own though.

jayaddison added 3 commits November 2, 2024 02:09

search: check that query terms exist as properties in term indices be…

47f6547

…fore accessing them

Tests: fixtures: add document title for ECMAScript testroot

713d6bb

Tests: add JavaScript test coverage for prototype-property-name query

0fff170

jayaddison added html search javascript Pull requests that update Javascript code labels Nov 2, 2024

Add CHANGES.rst entry

64d6e09

[skip ci]

jayaddison mentioned this pull request Nov 3, 2024

HTML search: sanitise 'searchindex.js' contents #13099

Closed

jayaddison requested a review from wlach November 4, 2024 21:02

wlach reviewed Nov 5, 2024

View reviewed changes

Tests: add negative-test case

9bab17c

The JS prototype property should _not_ be found in unrelated documents

wlach approved these changes Nov 11, 2024

View reviewed changes

jayaddison commented Nov 11, 2024

View reviewed changes

sphinx/themes/basic/static/searchtools.js Show resolved Hide resolved

jayaddison and others added 4 commits November 11, 2024 17:27

search: add explanatory comment for document title/text term lookup

9b1edac

search: retain undefined result instead of [] for non-matching terms

52b9de3

search: add explanatory comment regarding hasOwnProperty usage

bcb02fe

Merge branch 'master' into issue-13096/html-search-handle-proto-query

7267990

jayaddison added the DO NOT MERGE label Nov 14, 2024

jayaddison and others added 3 commits November 14, 2024 12:01

HTML search: encapsulate the search index content within a JSON string

462c859

HTML search: use the JSON.parse reviver argument to freeze search…

443fd21

… index contents Co-authored-by: Adam Turner <[email protected]>

Merge branch 'master' into issue-13096/html-search-handle-proto-query

295d230

jayaddison added the type:performance label Nov 14, 2024

Tests: fixup: accommodate adjusted setIndex prefix/suffix lengths

3498755

HTML search: replace repr with simple embedded single-quotes

f38ce20

The `JSON.dumps` call escapes single-quotes within strings; all JSON strings are enclosed by double-quotes

jayaddison added 12 commits November 15, 2024 19:00

Revert "HTML search: use the JSON.parse reviver argument to freez…

0d9a26c

…e search index contents" This reverts commit 443fd21.

WIP: HTML search: use JS Maps in serialized index

d4bba1e

[skip ci]

HTML search: fixup for object search

d9dee52

Relates-to commit d4bba1e. [skip ci]

HTML search: ensure array-format index entries are sorted

d056a8a

HTML search: refactor-out redundant code

15a7dfa

HTML search: rectify variable names

6e670ea

HTML search: compress index representation

f2848ed

HTML search: enclose index names in double-quotes

798cb48

Tests: HTML search: use ast.literal_eval to parse index contents

af61293

Tests: HTML search: update test expectation

2bcf7a6

HTML search: cleanup: remove redundant assert statement

381c26f

HTML search: typing/lint fixups for ruff and mypy

ac281dc

jayaddison commented Nov 16, 2024

View reviewed changes

Edit CHANGES.rst entry

fae81a0

[skip ci]

jayaddison changed the title ~~HTML search: check for query terms presence in index term properties~~ HTML search: represent indices using JavaScript Map instead of object literals Nov 16, 2024

wlach self-requested a review November 19, 2024 12:40

wlach reviewed Nov 19, 2024

View reviewed changes

jayaddison added 2 commits November 24, 2024 12:58

HTML search: use JSON format for the search index

6d5a121

Merge branch 'master' into issue-13096/html-search-handle-proto-query

39b4a4e

Conflicts: CHANGES.rst

jayaddison changed the title ~~HTML search: represent indices using JavaScript Map instead of object literals~~ HTML search: serialize search index as JSON Nov 24, 2024

jayaddison added 2 commits November 24, 2024 13:07

Tests: cleanup: remove unused ast import

7428932

HTML search: refactor loader to use Fetch API

22acc49

Ref: https://fetch.spec.whatwg.org/#fetch-api

wlach self-requested a review November 24, 2024 13:32

jayaddison closed this Nov 24, 2024

jayaddison deleted the issue-13096/html-search-handle-proto-query branch November 24, 2024 17:05

jayaddison mentioned this pull request Nov 24, 2024

html search: add safety check before index property accesses #13153

Open

HTML search: serialize search index as JSON #13097

HTML search: serialize search index as JSON #13097

Conversation

jayaddison commented Nov 2, 2024 • edited Loading

Feature or Bugfix

Purpose

Detail

Relates

wlach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AA-Turner commented Nov 13, 2024

jayaddison commented Nov 14, 2024

jayaddison commented Nov 14, 2024

jayaddison commented Nov 14, 2024

wlach commented Nov 14, 2024

jayaddison commented Nov 14, 2024

jayaddison commented Nov 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wlach Nov 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayaddison commented Nov 16, 2024

Initial results

Observations so far

wlach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayaddison commented Nov 19, 2024

jayaddison commented Nov 24, 2024

wlach commented Nov 28, 2024

jayaddison commented Nov 2, 2024 •

edited

Loading

wlach Nov 24, 2024 •

edited

Loading