-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML search: serialize search index as JSON #13097
HTML search: serialize search index as JSON #13097
Conversation
[skip ci]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how this would be a problem but I'm a little unclear on how the solution actually works? See comments
@@ -514,8 +514,8 @@ const Search = { | |||
searchTerms.forEach((word) => { | |||
const files = []; | |||
const arr = [ | |||
{ files: terms[word], score: Scorer.term }, | |||
{ files: titleTerms[word], score: Scorer.title }, | |||
{ files: terms.hasOwnProperty(word) ? terms[word] : [], score: Scorer.term }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually a little confused by how this might work, as it doesn't appear that JavaScript supports setting the __proto__
property in an object. Am I missing something? Should we be using a map to store these terms instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When parsing JSON, an object key named "__proto__"
is accepted, and converted to the JavaScript object key ["__proto__"]
-- distinct from the __proto__
property (a reference to a class prototype).
During read-access, introducing hasOwnProperty
allows us to filter out the latter, so that we don't mistakenly retrieve the prototype class reference from terms
.
I'm not certain whether the below example will help to clarify, but I find experimentation helpful to understand some of the behaviours:
> x = JSON.parse('{}')
{}
> y = JSON.parse('{"__proto__": "testing"}')
{ ['__proto__']: 'testing' }
> '__proto__' in x
true
> x.hasOwnProperty('__proto__')
false
> '__proto__' in y
true
> y.hasOwnProperty('__proto__')
true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to respond to your Map
question: that is certainly possible, although I feel that doing so might add some runtime overhead (since we'd need to load the native JS object contents from the index into respective Map
objects).
As I understand it, Map
objects are also more difficult to make immutable in JavaScript than Object
and Array
instances. Freezing a Map
object doesn't prevent its private contents from being updated:
> x = {'foo': 'bar'}
{ foo: 'bar' }
> Object.freeze(x)
{ foo: 'bar' }
> x['foo'] = 'baz'
'baz'
> x['foo']
'bar'
> y = new Map([['foo', 'bar']])
Map(1) { 'foo' => 'bar' }
> Object.freeze(y)
Map(1) { 'foo' => 'bar' }
> y.set('foo', 'baz')
Map(1) { 'foo' => 'baz' }
> y.get('foo')
'baz'
So another factor is that, to the extent that I'd like to see the search index contents become immutable at load-time in future (#13098), I'd prefer not to migrate to the Map
type unless there is a compelling reason to (maybe avoiding prototype pollution is that reason -- but I don't think we're exposed to that currently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok got it, this makes sense to me re: the property. TIL!
I'm a little bit less sure about whether the index should be immutable, but happy to defer that discussion to another day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, one extra piece of optional feedback: since this is kind of counter-intuitive, you might want to add a comment on this line explaining what you're doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, OK - I'll try to add some phrasing to explain this succinctly. Thanks for the code review!
The JS prototype property should _not_ be found in unrelated documents
@@ -514,8 +514,8 @@ const Search = { | |||
searchTerms.forEach((word) => { | |||
const files = []; | |||
const arr = [ | |||
{ files: terms[word], score: Scorer.term }, | |||
{ files: titleTerms[word], score: Scorer.title }, | |||
{ files: terms.hasOwnProperty(word) ? terms[word] : [], score: Scorer.term }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok got it, this makes sense to me re: the property. TIL!
I'm a little bit less sure about whether the index should be immutable, but happy to defer that discussion to another day.
Would this work as an alternative? // in Search.setIndex()
const frozen_index = JSON.parse(index, (key, value) => {
typeof value === "object" && value !== null
? Object.freeze(value)
: value;
}); If we change A |
@AA-Turner although I'd recommend some performance testing, that could be useul for #13098, yep. |
In fact, no, my mistake: |
I've pushed the change to use |
Personally, I'd be inclined to merge the previous version of this and address that (if at all) in a followup. Trying to "freeze" the index here feels like scope creep to me and it would be nice to get this fix in. |
That's reasonable, yep, and I've no strong opinions on whether we include or omit the |
The `JSON.dumps` call escapes single-quotes within strings; all JSON strings are enclosed by double-quotes
Without the reviver function, I recall the |
tests/test_search.py
Outdated
indices = ast.literal_eval( | ||
searchindex[16:-1] | ||
.replace('new Map', '') | ||
.replace('null', 'None') | ||
.replace('false', 'False') | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fragile and inelegant; it's a compromise to avoid introducing the complexity of a Python-implemented JavaScript parser dependency given the use of new Map(...)
within searchindex.js
files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative would be to dump the index as JSON, then reload it into the actual format we want. Would almost certainly be a little slower, but probably not disastrously so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain that in a little more detail? I think I understand what you mean, but I'd like to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I mean actually just writing out a JSON file and having the frontend code load it seperately via a fetch call. I'm not actually sure it would be slower, now that I think about it-- this post suggests this will be faster on the interpreter side: https://v8.dev/blog/cost-of-javascript-2019#json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks! (I wasn't entirely sure whether you were suggesting JSON format output only for the unit test code - using it in both seems more robust).
Let's give searchindex.json
a try, in that case. It'll make a few things simpler, although the searchtools.js
code may lose some of the readability improvements (we'll want to begin/reintroduce using obj.hasOwnProperty
again in a few places).
We may also want to be careful about where we place the fetch
call so that the browser doesn't block for unnecessarily long before invoking that (perhaps especially relevant for the case of downstream themes that include their own scripting that may also run on each page load).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's give searchindex.json a try, in that case. It'll make a few things simpler, although the searchtools.js code may lose some of the readability improvements (we'll want to begin/reintroduce using obj.hasOwnProperty again in a few places).
Personally I would read the search index as JSON, then convert it into a Map (i.e. building on the work you already did).
I'd prefer to avoid multiple read passes over the search index data on the client-side, if possible. Similarly that makes me reluctant to use the reviver
function in JSON.parse
, and I considered it a downside of the freeze-and-deproto code I had implemented in #13098.
Aside from code clarity, do you anticipate many benefits from the transformation to Map
datastructures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from code clarity, do you anticipate many benefits from the transformation to Map datastructures?
Not really, my guess is that getting entries out of a map will be a little faster (since it's a datastructure purely oriented around storing and retrieving keys/values) but probably not noticeably so. I think it's at least worth exploring though. The conversion should just be a few extra lines of code after you've loaded the JSON structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mm, I agree it'd only be a few lines of code, but I don't think that it'd be worth the resulting compute resource expenditure to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I feel inclined to disentangle the two aspects of this pull request: adding support for __proto__
as an indexed/queryable term, and preventing __proto__
queries from accessing JavaScript object prototypes.
The latter seems more important (it's a bug) - the former is a feature request that hasn't really been requested by anyone other than myself during development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've extracted the bugfix part of this changeset into #13153; I don't have any plans currently to add feature support for term/title-term matching of the query term __proto__
itself.
[skip ci]
I haven't gotten around to this comprehensive performance testing yet, but want to share some notes: Initial resultsInitially the native The page-load searchindex-evaluation time on my machine for a query for There is a small amount of index size bloat -- not so much from the Observations so farThere seem to be multiple factors to balance: Python complexity, JavaScript complexity, code safety (both datastructure immutability, and also prototype access), runtime performance and index filesize. I'm not sure how much to explore and how to present my thoughts/findings. Generally my feeling is that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a small amount of index size bloat -- not so much from the new Map( and ) text overhead, but more from the additional Array literals (e.g. square brackets) used to format their entries instead of object literals. This is approximately 5% overhead on the test dataset; 541857 bytes compared to 516691 baseline.
My guess is that most of this difference (which is relatively small) would disappear with compression.
There seem to be multiple factors to balance: Python complexity, JavaScript complexity, code safety (both datastructure immutability, and also prototype access), runtime performance and index filesize.
Yep, these things are always about tradeoffs.
I'm not sure how much to explore and how to present my thoughts/findings. Generally my feeling is that the Map approach here (again, up to fae81a0) seems good on most criteria except for Python complexity (particularly the unit tests, but also the application serialization code is more complex than it was) and immutability (that it does not address, in my opinion). The tc39 Records-and-Tuples proposal, I think, would be near-optimal on all factors -- but we could be years away from that reaching widespread adoption.
I think it's a pretty good solution. I'll defer approving in case you want to experiment / document more, but I think this fixes the bug while still being reasonable to maintain.
sphinx/search/__init__.py
Outdated
for name, entries in sorted(data.items()) | ||
} | ||
data_js = '{' + ','.join(f'{k}:{v}' for k, v in js_indices.items()) + '}' | ||
return self.PREFIX + data_js + self.SUFFIX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mulling over the suggestion; making the logic and output format of the method clearer would both be helpful, but I'd probably prefer not to include an additional technology in the mix (even though as you mention, it's already available as an application dependency).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use it if it's available and it makes things clearer. It's usage elsewhere in Sphinx is pretty extensive: https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20jinja&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(although dumping the index as JSON is perhaps a better alternative)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wlach I'm fairly sure Jinja was originally written for use in Sphinx / Flask -- they're all originally Pocco projects!
tests/test_search.py
Outdated
indices = ast.literal_eval( | ||
searchindex[16:-1] | ||
.replace('new Map', '') | ||
.replace('null', 'None') | ||
.replace('false', 'False') | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative would be to dump the index as JSON, then reload it into the actual format we want. Would almost certainly be a little slower, but probably not disastrously so.
I should have thought to evaluate that at the same time - from a quick check with |
This was an error in my reasoning, and it may have caused a lot of unnecessary confusion (I confused myself with it, too). The important part of the bug to fix here, in my opinion, is that a specific query -- for the term My change in 462c859 introduced something tangentially related, but different -- it added support for search index representation of, and subsequent querying of, My apologies for the extra work and meandering explorations caused by this. |
No worries! I think this was a useful exercise and I think we might still want some version of your later work on this. Agreed that getting a small fix in for the originally reported bug is a good incremental step worth doing on its own though. |
Feature or Bugfix
Purpose
basic
theme.Detail
Object.hasOwnProperty
.Represent the HTML search indices as JavaScriptMap
instances instead ofObject
literals.__proto__
can be serialized and loaded.Relates
May resolve # 13098 (conditional on inclusion ofJSON.parse
reviver /Object.freeze
logic)Edit: update description to reflect JavaScript Map-based approach.
Edit: update description to reflect JSON-serialization approach.