Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve non-English matching #374

Merged
merged 45 commits into from
Sep 19, 2023
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
f9f8fe5
Fix CSS selector naming
GioSensation Aug 22, 2023
8faf88f
Move more regexes to config and centralise access
GioSensation Aug 22, 2023
a1b7112
Minor type fix
GioSensation Aug 22, 2023
176445a
Improve and rename getText to getTextShallow
GioSensation Aug 22, 2023
eaef31e
Avoid evaluating buttons twice in FormAnalyzer
GioSensation Aug 22, 2023
e3dcdde
Evaluate URL when analyzing the form
GioSensation Aug 22, 2023
bf79aa5
Centralise safeUniversalSelector
GioSensation Aug 22, 2023
4d28c3e
Fix form scanning when markup is broken
GioSensation Aug 22, 2023
2ad2ac2
Simpler regex handling (no toLowerCase)
GioSensation Aug 22, 2023
8dcdf18
Minor accuracy improvement
GioSensation Aug 22, 2023
56cf736
Rename email strategy for consistency
GioSensation Aug 22, 2023
9b351b8
Add/update test cases
GioSensation Aug 23, 2023
3215447
Add performance logging capability
GioSensation Aug 23, 2023
12b886d
Remove strict/conservative regex distinction
GioSensation Aug 25, 2023
1ee12ef
Log when site is disabled remotely
GioSensation Aug 25, 2023
f479ca7
Add the unknown matcher
GioSensation Aug 25, 2023
c5c6e01
Update fixed test
GioSensation Aug 25, 2023
eb9ec9e
Fix comment and unused export
GioSensation Aug 28, 2023
058a395
Merge branch 'main' of github.com:duckduckgo/duckduckgo-autofill into…
GioSensation Aug 29, 2023
b6f6886
Fix tests
GioSensation Aug 29, 2023
da9b508
Initial regexes commit
GioSensation Aug 29, 2023
61cddb3
Add missing regexes
GioSensation Aug 31, 2023
9cf852a
Update selectors
GioSensation Sep 6, 2023
59e619e
Better safeguard for password type
GioSensation Sep 6, 2023
ac14ae8
Improve matchers after first tests
GioSensation Sep 6, 2023
5933e8a
Improve submit button identification
GioSensation Sep 6, 2023
adc1317
Add test case
GioSensation Sep 6, 2023
1aef9ba
Mock getComputedStyles to speed up tests
GioSensation Sep 11, 2023
99b590b
Batch form file reading
GioSensation Sep 11, 2023
3253b7e
Memoize isCCForm
GioSensation Sep 11, 2023
51aa77b
Minimise impact of Array.from
GioSensation Sep 11, 2023
2338e41
Limit test logs to surface failures
GioSensation Sep 11, 2023
97d4553
Centralize scan stopping and disconnect observer
GioSensation Sep 11, 2023
87bca48
Update snapshots and commit compiled files
GioSensation Sep 11, 2023
9edae2f
Commit compiled code
GioSensation Sep 11, 2023
1562715
Add performance logging capability
GioSensation Aug 23, 2023
e66a74e
Commit files
GioSensation Sep 11, 2023
2ebe89e
Add comment
GioSensation Sep 12, 2023
98c7d94
Minor change
GioSensation Sep 12, 2023
73ac4cd
Improve stopping the scanner
GioSensation Sep 12, 2023
e528cd4
Merge branch 'ema/speed-up-tests' into ema/improve-non-english-matching
GioSensation Sep 12, 2023
b3a65c8
More accuracy improvements after review
GioSensation Sep 15, 2023
eab7bf5
Merge branch 'main' of github.com:duckduckgo/duckduckgo-autofill into…
GioSensation Sep 18, 2023
6aeae78
Final improvements
GioSensation Sep 18, 2023
bdd6e61
Fix test
GioSensation Sep 19, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
416 changes: 261 additions & 155 deletions dist/autofill-debug.js

Large diffs are not rendered by default.

416 changes: 261 additions & 155 deletions dist/autofill.js

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion src/DeviceInterface/InterfacePrototype.js
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,7 @@ class InterfacePrototype {
await this.postInit()

if (this.settings.featureToggles.credentials_saving) {
initFormSubmissionsApi(this.scanner.forms)
initFormSubmissionsApi(this.scanner.forms, this.scanner.matching)
}
}

Expand Down
13 changes: 7 additions & 6 deletions src/DeviceInterface/initFormSubmissionsApi.js
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
import {SUBMIT_BUTTON_SELECTOR} from '../Form/selectors-css.js'
shakyShane marked this conversation as resolved.
Show resolved Hide resolved
import {buttonMatchesFormType, getText} from '../autofill-utils.js'
import {buttonMatchesFormType, getTextShallow} from '../autofill-utils.js'
import {extractElementStrings} from '../Form/label-util.js'

/**
* This is a single place to contain all functionality relating to form submission detection
*
* @param {Map<HTMLElement, import("../Form/Form").Form>} forms
* @param {import("../Form/matching").Matching} matching
*/
export function initFormSubmissionsApi (forms) {
export function initFormSubmissionsApi (forms, matching) {
/**
* Global submit events
*/
Expand Down Expand Up @@ -44,13 +45,13 @@ export function initFormSubmissionsApi (forms) {
matchingForm?.submitHandler('global pointerdown event + matching form')

if (!matchingForm) {
const selector = SUBMIT_BUTTON_SELECTOR + ', a[href="#"], a[href^=javascript], *[onclick]'
const selector = matching.cssSelector('submitButtonSelector') + ', a[href="#"], a[href^=javascript], *[onclick], [class*=button i]'
// check if the click happened on a button
const button = /** @type HTMLElement */(event.target)?.closest(selector)
if (!button) return

const text = getText(button)
const hasRelevantText = /(log|sign).?(in|up)|continue|next|submit/i.test(text)
const text = getTextShallow(button) || extractElementStrings(button).join(' ')
shakyShane marked this conversation as resolved.
Show resolved Hide resolved
const hasRelevantText = matching.getDDGMatcherRegex('submitButtonRegex')?.test(text)
if (hasRelevantText && text.length < 25) {
// check if there's a form with values
const filledForm = [...forms.values()].find(form => form.hasValues())
Expand Down
18 changes: 11 additions & 7 deletions src/Form/Form.js
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import {
isEventWithinDax,
isLikelyASubmitButton,
isPotentiallyViewable, buttonMatchesFormType,
safeExecute, getText, wasAutofilledByChrome, shouldLog
safeExecute, getTextShallow, wasAutofilledByChrome, shouldLog
} from '../autofill-utils.js'

import {getInputSubtype, getInputMainType, createMatching, safeRegex} from './matching.js'
Expand Down Expand Up @@ -216,8 +216,8 @@ class Form {
formValues.credentials.username = formValues.identities.phone
} else {
// If we still don't have a username, try scanning the form's text for an email address
this.form.querySelectorAll('*:not(select):not(option)').forEach((el) => {
const elText = getText(el)
this.form.querySelectorAll(this.matching.cssSelector('safeUniversalSelector')).forEach((el) => {
const elText = getTextShallow(el)
// Ignore long texts to avoid false positives
if (elText.length > 70) return

Expand Down Expand Up @@ -358,11 +358,15 @@ class Form {
}

categorizeInputs () {
const selector = this.matching.cssSelector('FORM_INPUTS_SELECTOR')
const selector = this.matching.cssSelector('formInputsSelector')
if (this.form.matches(selector)) {
this.addInput(this.form)
} else {
const foundInputs = this.form.querySelectorAll(selector)
let foundInputs = this.form.querySelectorAll(selector)
// If the markup is broken form.querySelectorAll may not return the fields, so we select from the parent
if (foundInputs.length === 0 && this.form instanceof HTMLFormElement && this.form.length > 0) {
foundInputs = this.form.parentElement?.querySelectorAll(selector) || foundInputs
}
if (foundInputs.length < MAX_INPUTS_PER_FORM) {
foundInputs.forEach(input => this.addInput(input))
} else {
Expand All @@ -375,12 +379,12 @@ class Form {
}

get submitButtons () {
const selector = this.matching.cssSelector('SUBMIT_BUTTON_SELECTOR')
const selector = this.matching.cssSelector('submitButtonSelector')
const allButtons = /** @type {HTMLElement[]} */([...this.form.querySelectorAll(selector)])

return allButtons
.filter((btn) =>
isPotentiallyViewable(btn) && isLikelyASubmitButton(btn) && buttonMatchesFormType(btn, this)
isPotentiallyViewable(btn) && isLikelyASubmitButton(btn, this.matching) && buttonMatchesFormType(btn, this)
)
}

Expand Down
60 changes: 37 additions & 23 deletions src/Form/FormAnalyzer.js
Original file line number Diff line number Diff line change
@@ -1,16 +1,7 @@
import { removeExcessWhitespace, Matching } from './matching.js'
import { constants } from '../constants.js'
import { matchingConfiguration } from './matching-configuration.js'
import { getText, isLikelyASubmitButton } from '../autofill-utils.js'

const loginRegex = new RegExp(/sign(ing)?.?in(?!g)|log.?(i|o)n|log.?out|unsubscri|(forgot(ten)?|reset) (your )?password|password (forgotten|lost)|unlock|logged in as|mfa-submit-form/i)
const signupRegex = new RegExp(
/sign(ing)?.?up|join|\bregist(er|ration)|newsletter|\bsubscri(be|ption)|contact|create|start|enroll|settings|preferences|profile|update|checkout|guest|purchase|buy|order|schedule|estimate|request|new.?customer|(confirm|retype|repeat) password|password confirm?/i
)
const conservativeSignupRegex = new RegExp(/sign.?up|join|register|enroll|newsletter|subscri(be|ption)|settings|preferences|profile|update/i)
const strictSignupRegex = new RegExp(/sign.?up|join|register|(create|new).+account|enroll|settings|preferences|profile|update/i)
const resetPasswordLink = new RegExp(/(forgot(ten)?|reset|don't remember) (your )?password|password forgotten/i)
const loginProvidersRegex = new RegExp(/ with /i)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved them all to the matching configuration, so they're centralized and easier to translated.

import { getTextShallow, isLikelyASubmitButton } from '../autofill-utils.js'

class FormAnalyzer {
/** @type HTMLElement */
Expand Down Expand Up @@ -126,16 +117,16 @@ class FormAnalyzer {
shouldCheckUnifiedForm = false,
shouldBeConservative = false
}) {
const matchesLogin = /current.?password/i.test(string) || loginRegex.test(string) || resetPasswordLink.test(string)
const matchesLogin = /current.?password/i.test(string) || this.matching.getDDGMatcherRegex('loginRegex')?.test(string) || this.matching.getDDGMatcherRegex('resetPasswordLink')?.test(string)

// Check explicitly for unified login/signup forms
if (shouldCheckUnifiedForm && matchesLogin && strictSignupRegex.test(string)) {
if (shouldCheckUnifiedForm && matchesLogin && this.matching.getDDGMatcherRegex('conservativeSignupRegex')?.test(string)) {
this.increaseHybridSignal(strength, signalType)
return this
}

const signupRegexToUse = shouldBeConservative ? conservativeSignupRegex : signupRegex
const matchesSignup = /new.?password/i.test(string) || signupRegexToUse.test(string)
const signupRegexToUse = this.matching.getDDGMatcherRegex(shouldBeConservative ? 'conservativeSignupRegex' : 'signupRegex')
const matchesSignup = /new.?password/i.test(string) || signupRegexToUse?.test(string)

// In some cases a login match means the login is somewhere else, i.e. when a link points outside
if (shouldFlip) {
Expand All @@ -162,6 +153,24 @@ class FormAnalyzer {
})
}

evaluateUrl () {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just another signal.

const path = window.location.pathname

const matchesLogin = this.matching.getDDGMatcherRegex('loginRegex')?.test(path)
const matchesSignup = this.matching.getDDGMatcherRegex('conservativeSignupRegex')?.test(path)

// If the url matches both, do nothing: the signal is probably confounding
if (matchesLogin && matchesSignup) return

if (matchesLogin) {
this.decreaseSignalBy(1, 'url matches login')
}

if (matchesSignup) {
this.increaseSignalBy(1, 'url matches signup')
}
}

evaluatePageTitle () {
const pageTitle = document.title
this.updateSignal({string: pageTitle, strength: 2, signalType: `page title: ${pageTitle}`, shouldCheckUnifiedForm: true})
Expand All @@ -185,7 +194,7 @@ class FormAnalyzer {
this.evaluatePageTitle()
this.evaluatePageHeadings()
// Check for submit buttons
const buttons = document.querySelectorAll(this.matching.cssSelector('SUBMIT_BUTTON_SELECTOR'))
const buttons = document.querySelectorAll(this.matching.cssSelector('submitButtonSelector'))
buttons.forEach(button => {
// if the button has a form, it's not related to our input, because our input has no form here
if (button instanceof HTMLButtonElement) {
Expand All @@ -198,7 +207,7 @@ class FormAnalyzer {
}

evaluateElement (el) {
const string = getText(el)
const string = getTextShallow(el)

if (el.matches(this.matching.cssSelector('password'))) {
// These are explicit signals by the web author, so we weigh them heavily
Expand All @@ -207,12 +216,13 @@ class FormAnalyzer {
strength: 5,
signalType: `explicit: ${el.getAttribute('autocomplete')}`
})
return
Copy link
Member Author

@GioSensation GioSensation Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added these returns, otherwise certain elements could be counted multiple times.

}

// check button contents
if (el.matches(this.matching.cssSelector('SUBMIT_BUTTON_SELECTOR'))) {
if (el.matches(this.matching.cssSelector('submitButtonSelector') + ', *[class*=button]')) {
// If we're confident this is the submit button, it's a stronger signal
let likelyASubmit = isLikelyASubmitButton(el)
let likelyASubmit = isLikelyASubmitButton(el, this.matching)
if (likelyASubmit) {
this.form.querySelectorAll('input[type=submit], button[type=submit]').forEach(
(submit) => {
Expand All @@ -225,6 +235,7 @@ class FormAnalyzer {
}
const strength = likelyASubmit ? 20 : 2
this.updateSignal({string, strength, signalType: `submit: ${string}`})
return
}
// if an external link matches one of the regexes, we assume the match is not pertinent to the current form
if (
Expand All @@ -234,8 +245,8 @@ class FormAnalyzer {
) {
let shouldFlip = true
if (
resetPasswordLink.test(string) || // Don't flip forgotten password links
loginProvidersRegex.test(string) // Don't flip login providers links
this.matching.getDDGMatcherRegex('resetPasswordLink')?.test(string) || // Don't flip forgotten password links
this.matching.getDDGMatcherRegex('loginProvidersRegex')?.test(string) // Don't flip login providers links
) {
shouldFlip = false
}
Expand All @@ -250,14 +261,17 @@ class FormAnalyzer {
}

evaluateForm () {
// Check page url
this.evaluateUrl()

// Check page title
this.evaluatePageTitle()

// Check form attributes
this.evaluateElAttributes(this.form)

// Check form contents (skip select and option because they contain too much noise)
this.form.querySelectorAll('*:not(select):not(option):not(script)').forEach(el => {
// Check form contents (noisy elements are skipped with the safeUniversalSelector)
this.form.querySelectorAll(this.matching.cssSelector('safeUniversalSelector')).forEach(el => {
// Check if element is not hidden. Note that we can't use offsetHeight
// nor intersectionObserver, because the element could be outside the
// viewport or its parent hidden
Expand All @@ -266,7 +280,7 @@ class FormAnalyzer {
})

// A form with many fields is unlikely to be a login form
const relevantFields = this.form.querySelectorAll(this.matching.cssSelector('GENERIC_TEXT_FIELD'))
const relevantFields = this.form.querySelectorAll(this.matching.cssSelector('genericTextField'))
if (relevantFields.length >= 4) {
this.increaseSignalBy(relevantFields.length * 1.5, 'many fields: it is probably not a login')
}
Expand Down
4 changes: 2 additions & 2 deletions src/Form/formatters.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ const FOUR_DIGIT_YEAR_REGEX = /(\D)\1{3}|\d{4}/i
* @returns {string}
*/
const formatCCYear = (input, year, form) => {
const selector = form.matching.cssSelector('FORM_INPUTS_SELECTOR')
const selector = form.matching.cssSelector('formInputsSelector')
if (
input.maxLength === 4 ||
checkPlaceholderAndLabels(input, FOUR_DIGIT_YEAR_REGEX, form.form, selector)
Expand All @@ -34,7 +34,7 @@ const formatCCYear = (input, year, form) => {
const getUnifiedExpiryDate = (input, month, year, form) => {
const formattedYear = formatCCYear(input, year, form)
const paddedMonth = `${month}`.padStart(2, '0')
const cssSelector = form.matching.cssSelector('FORM_INPUTS_SELECTOR')
const cssSelector = form.matching.cssSelector('formInputsSelector')
const separator = matchInPlaceholderAndLabels(input, DATE_SEPARATOR_REGEX, form.form, cssSelector)?.groups?.separator || '/'

return `${paddedMonth}${separator}${formattedYear}`
Expand Down
4 changes: 2 additions & 2 deletions src/Form/input-classifiers.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ import { getInputSubtype, createMatching } from './matching.js'
import { Form } from './Form.js'
import InterfacePrototype from '../DeviceInterface/InterfacePrototype.js'

import {SUBMIT_BUTTON_SELECTOR} from './selectors-css.js'
import {createAvailableInputTypes} from '../../integration-test/helpers/utils.js'

/**
Expand Down Expand Up @@ -167,7 +166,8 @@ describe.each(testCases)('Test $html fields', (testCase) => {
baseWrapper.innerHTML = testContent
document.title = title

const buttons = document.querySelectorAll(SUBMIT_BUTTON_SELECTOR)
const matching = createMatching()
const buttons = document.querySelectorAll(matching.cssSelector('submitButtonSelector'))
buttons.forEach((button) => {
// We're doing this so that isPotentiallyViewable(button) === true. See jest.setup.js for more info
// @ts-ignore
Expand Down
2 changes: 1 addition & 1 deletion src/Form/label-util.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ const EXCLUDED_TAGS = ['SCRIPT', 'NOSCRIPT', 'OPTION', 'STYLE']
* "element.textContent" is a string which is merged of all children nodes,
* which can cause issues with things like script tags etc.
*
* @param {HTMLElement} element
* @param {Element} element
* A DOM element to be extracted.
* @returns {string[]}
* All strings in an element.
Expand Down
Loading