Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precompile regexes #382

Merged
merged 7 commits into from
Sep 21, 2023
Merged

Precompile regexes #382

merged 7 commits into from
Sep 21, 2023

Conversation

GioSensation
Copy link
Member

@GioSensation GioSensation commented Sep 21, 2023

Reviewer: @shakyShane
Asana: https://app.asana.com/0/0/1205509987156723/f

Description

Precompiles the regexes at build time instead of at runtime. This is to avoid calling new RegExp up to several thousand times in certain edge cases. The actual performance improvement is negligible on most cases, but in certain edge cases it can go up to 10-12% of total scan time. On the test suite, it shaves off around a second (on my machine). The idea came initially from Lucas.

Steps to test

All tests passing 🎉. CI should be good as well, because we just use the same old grunt workflow.

Signed-off-by: Emanuele Feliziani <[email protected]>
Signed-off-by: Emanuele Feliziani <[email protected]>
@GioSensation GioSensation self-assigned this Sep 21, 2023
Signed-off-by: Emanuele Feliziani <[email protected]>
Signed-off-by: Emanuele Feliziani <[email protected]>
.gitattributes Outdated
@@ -1,2 +1,3 @@
dist/** binary linguist-generated
swift-package/Resources/assets/** binary linguist-generated
src/Form/matching-config/__generated__/** binary linguist-generated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this to the generated files so it doesn't pollute the diffs in reviews. A counterargument could be that manually reviewing it could ensure proper output, but the tests are in charge of that 💪. Let me know if you disagree.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this instance, I think we shouldn't hide the diff - mostly just because of what the Node docs say about inspect

The output of util.inspect may change at any time and should not be depended upon programmatically

likewise I'm not 100% how it's handling unicode output - I think we should add a unit test to this PR - just a very simple one that snapshots the output of util.inspect on a handful of the more exotic regexes. In that same test file you can require the compiled output and test against a few strings.

^ none of that has to be exhaustive - but given the amount of strings/arrays we're compressing here, I think a few tests would really help us understand what's happening - not to mention it would be a great place to add regression tests if we need them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not hiding the diff makes sense. I'll do that.

On the testing, I'm a little unclear what exactly you would test. We have a lot of tests that include characters like メールアドレス, 電話, 姓, plus Russian characters and whatnot. I could snapshot the output on a few of these, but overall if inspect changes and starts failing on certain inputs, it would happen during a node upgrade and will break several of the existing tests. I guess a specific unit test could decrease debugging time. Is that your rationale?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that because we're checking in the file, it makes it much less risky. As you say, so much testing is done elsewhere too - so we could come back here if/when inspect changes it's output. Happy with that approach :)

`/* DO NOT EDIT, this file was generated by scripts/precompile-regexes.js */\n\n`,
`/** @type {MatchingConfiguration} */\n`,
'const matchingConfiguration = ',
inspect(matchingConfiguration, {maxArrayLength: Infinity, depth: Infinity, maxStringLength: Infinity}),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a node function that prints out objects as strings. It's used by the console, for example, when you log objects and such. It works fine for our needs, in particular for outputting RegExp objects as literals. All the alternatives seemed much more effort-intensive for no apparent benefit.

Signed-off-by: Emanuele Feliziani <[email protected]>
@@ -1007,4 +1006,4 @@ const matchingConfiguration = {
}
}

export { matchingConfiguration }
module.exports = { matchingConfiguration }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using CommonJS since these are handled in node at build time.

@@ -128,15 +126,12 @@ class Matching {
console.warn('CSS selector not found for %s, using a default value', selectorName)
return ''
}
if (Array.isArray(match)) {
return match.join(',')
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done at build time 🎉.

Signed-off-by: Emanuele Feliziani <[email protected]>
return undefined
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regexes are sanitized at build time.

@GioSensation GioSensation marked this pull request as ready for review September 21, 2023 08:35
Copy link
Collaborator

@shakyShane shakyShane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GioSensation this looks great - I like the idea of pre-compiling for sure.

I've just added one comment about how I think some basic unit-tests would help here, and how I don't think we should bury the diff output of this generated file - let me know what you think or if you have counter-ideas :)

.gitattributes Outdated
@@ -1,2 +1,3 @@
dist/** binary linguist-generated
swift-package/Resources/assets/** binary linguist-generated
src/Form/matching-config/__generated__/** binary linguist-generated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this instance, I think we shouldn't hide the diff - mostly just because of what the Node docs say about inspect

The output of util.inspect may change at any time and should not be depended upon programmatically

likewise I'm not 100% how it's handling unicode output - I think we should add a unit test to this PR - just a very simple one that snapshots the output of util.inspect on a handful of the more exotic regexes. In that same test file you can require the compiled output and test against a few strings.

^ none of that has to be exhaustive - but given the amount of strings/arrays we're compressing here, I think a few tests would really help us understand what's happening - not to mention it would be a great place to add regression tests if we need them

@GioSensation GioSensation merged commit 4a0fd69 into main Sep 21, 2023
1 check passed
@GioSensation GioSensation deleted the ema/precompile-regexes branch September 21, 2023 10:48
brindy added a commit to duckduckgo/macos-browser that referenced this pull request Sep 28, 2023
Task/Issue URL:
https://app.asana.com/0/1205602928782192/1205602928782192
Autofill Release:
https://github.com/duckduckgo/duckduckgo-autofill/releases/tag/8.4.1
BSK PR: duckduckgo/BrowserServicesKit#517

## Description
Updates Autofill to version
[8.4.1](https://github.com/duckduckgo/duckduckgo-autofill/releases/tag/8.4.1).

### Autofill 8.4.1 release notes
## What's Changed
* Precompile regexes by @GioSensation in
duckduckgo/duckduckgo-autofill#382
* Add perf tests by @GioSensation in
duckduckgo/duckduckgo-autofill#383
* deps: remove deprecated content-scope-utils by @shakyShane in
duckduckgo/duckduckgo-autofill#384
* Update dependencies by @GioSensation in
duckduckgo/duckduckgo-autofill#368
* Fix perf regressions by @GioSensation in
duckduckgo/duckduckgo-autofill#385


**Full Changelog**:
duckduckgo/duckduckgo-autofill@8.4.0...8.4.1

## Steps to test
This release has been tested during autofill development. For smoke test
steps see [this
task](https://app.asana.com/0/1198964220583541/1200583647142330/f).

---------

Co-authored-by: GioSensation <[email protected]>
Co-authored-by: Chris Brind <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants