Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precompile regexes #382

Merged
merged 7 commits into from
Sep 21, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .eslintrc
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
"integration-test/extension/autofill.js",
"integration-test/extension/autofill-debug.js",
"src/deviceApiCalls/__generated__/*",
"src/Form/matching-config/__generated__/*",
"playwright-report/*"
]
}
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
dist/** binary linguist-generated
swift-package/Resources/assets/** binary linguist-generated
src/Form/matching-config/__generated__/** binary linguist-generated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this to the generated files so it doesn't pollute the diffs in reviews. A counterargument could be that manually reviewing it could ensure proper output, but the tests are in charge of that 💪. Let me know if you disagree.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this instance, I think we shouldn't hide the diff - mostly just because of what the Node docs say about inspect

The output of util.inspect may change at any time and should not be depended upon programmatically

likewise I'm not 100% how it's handling unicode output - I think we should add a unit test to this PR - just a very simple one that snapshots the output of util.inspect on a handful of the more exotic regexes. In that same test file you can require the compiled output and test against a few strings.

^ none of that has to be exhaustive - but given the amount of strings/arrays we're compressing here, I think a few tests would really help us understand what's happening - not to mention it would be a great place to add regression tests if we need them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not hiding the diff makes sense. I'll do that.

On the testing, I'm a little unclear what exactly you would test. We have a lot of tests that include characters like メールアドレス, 電話, 姓, plus Russian characters and whatnot. I could snapshot the output on a few of these, but overall if inspect changes and starts failing on certain inputs, it would happen during a node upgrade and will break several of the existing tests. I guess a specific unit test could decrease debugging time. Is that your rationale?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that because we're checking in the file, it makes it much less risky. As you say, so much testing is done elsewhere too - so we could come back here if/when inspect changes it's output. Happy with that approach :)

8 changes: 7 additions & 1 deletion Gruntfile.js
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,8 @@ module.exports = function (grunt) {
},
exec: {
copyAssets: 'npm run copy-assets',
schemaCompile: 'npm run schema:generate'
schemaCompile: 'npm run schema:generate',
precompileRegexes: 'npm run precompile-regexes'
},
/**
* Run predefined tasks whenever watched files are added,
Expand All @@ -89,6 +90,10 @@ module.exports = function (grunt) {
files: ['src/deviceApiCalls/**/*.{json,js}', 'packages/device-api/**/*.{json,js}'],
tasks: ['exec:schemaCompile']
},
precompileRegexes: {
files: ['src/Form/matching-config/*'],
tasks: ['exec:precompileRegexes']
},
scripts: {
files: ['src/**/*.{json,js}', 'packages/password/**/*.{json,js}', 'packages/device-api/**/*.{json,js}'],
tasks: ['browserify:dist', 'browserify:debug', 'exec:copyAssets']
Expand All @@ -105,6 +110,7 @@ module.exports = function (grunt) {
})

grunt.registerTask('default', [
'exec:precompileRegexes',
'exec:schemaCompile',
'browserify:dist',
'browserify:debug',
Expand Down
861 changes: 215 additions & 646 deletions dist/autofill-debug.js

Large diffs are not rendered by default.

861 changes: 215 additions & 646 deletions dist/autofill.js

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
"lint": "eslint .",
"lint:fix": "npm run lint -- --fix",
"copy-assets": "node scripts/copy-assets.js",
"precompile-regexes": "node scripts/precompile-regexes.js",
"open-test-extension": "npx web-ext run -t chromium -u https://privacy-test-pages.site/ -s integration-test/extension",
"schema:generate": "node scripts/api-call-generator.js",
"test": "npm run test:unit && npm run lint && tsc",
Expand Down
72 changes: 72 additions & 0 deletions scripts/precompile-regexes.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
const {matchingConfiguration} = require('../src/Form/matching-config/matching-config-source.js')
const {writeFileSync} = require('fs')
const {join} = require('path')
const {inspect} = require('util')

/**
* DDGRegexes are stored as strings so we can annotate them with comments, here we transform them into RegExp
*/

/**
* Loop through Object.entries and transform all values to RegExp
* @param {Object} obj
*/
function convertAllValuesToRegex (obj) {
for (const [key, value] of Object.entries(obj)) {
const source = String(value).normalize('NFKC')
obj[key] = new RegExp(source, 'ui')
}
return obj
}
for (const [key, value] of Object.entries(matchingConfiguration.strategies.ddgMatcher.matchers)) {
matchingConfiguration.strategies.ddgMatcher.matchers[key] = convertAllValuesToRegex(value)
}

/**
* Prepare CSS rules by concatenating arrays and removing whitespace
*/
Object.entries(matchingConfiguration.strategies.cssSelector.selectors).forEach(([name, selector]) => {
if (Array.isArray(selector)) {
selector = selector.join(',')
}
matchingConfiguration.strategies.cssSelector.selectors[name] = selector.replace(/\n/g, ' ').replace(/\s{2,}/g, ' ').trim()
})

/**
* VendorRules come from different providers, here we merge them all together in one RegEx per inputType
*/

/**
* Merge our vendor rules into a single RegEx
* @param {keyof VendorRegexRules} ruleName
* @param {VendorRegexConfiguration["ruleSets"]} ruleSets
* @return {{RULES: Record<keyof VendorRegexRules, RegExp | undefined>}}
*/
function mergeVendorRules (ruleName, ruleSets) {
let rules = []
ruleSets.forEach(set => {
if (set[ruleName]) {
rules.push(`(${set[ruleName]?.toLowerCase()})`.normalize('NFKC'))
}
})
return new RegExp(rules.join('|'), 'iu')
}
const ruleSets = matchingConfiguration.strategies.vendorRegex.ruleSets
for (const ruleName of Object.keys(matchingConfiguration.strategies.vendorRegex.rules)) {
matchingConfiguration.strategies.vendorRegex.rules[ruleName] = mergeVendorRules(ruleName, ruleSets)
}

/**
* Build the file contents
*/
const fileContents = [
`/* DO NOT EDIT, this file was generated by scripts/precompile-regexes.js */\n\n`,
`/** @type {MatchingConfiguration} */\n`,
'const matchingConfiguration = ',
inspect(matchingConfiguration, {maxArrayLength: Infinity, depth: Infinity, maxStringLength: Infinity}),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a node function that prints out objects as strings. It's used by the console, for example, when you log objects and such. It works fine for our needs, in particular for outputting RegExp objects as literals. All the alternatives seemed much more effort-intensive for no apparent benefit.

'\n\nexport { matchingConfiguration }\n'
].join('')

// Write to file
const outputPath = join(__dirname, '../src/Form/matching-config/__generated__', '/compiled-matching-config.js')
writeFileSync(outputPath, fileContents)
2 changes: 1 addition & 1 deletion src/Form/Form.js
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ class Form {
// If we have a password but no username, let's search further
const hiddenFields = /** @type [HTMLInputElement] */([...this.form.querySelectorAll('input[type=hidden]')])
const probableField = hiddenFields.find((field) => {
const regex = safeRegex('email|' + this.matching.ddgMatcher('username')?.match)
const regex = safeRegex('email|' + this.matching.getDDGMatcherRegex('username')?.source)
const attributeText = field.id + ' ' + field.name
return regex?.test(attributeText)
})
Expand Down
2 changes: 1 addition & 1 deletion src/Form/FormAnalyzer.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import { removeExcessWhitespace, Matching } from './matching.js'
import { constants } from '../constants.js'
import { matchingConfiguration } from './matching-configuration.js'
import { matchingConfiguration } from './matching-config/__generated__/compiled-matching-config.js'
import { getTextShallow, isLikelyASubmitButton } from '../autofill-utils.js'

class FormAnalyzer {
Expand Down
Loading