Skip to content

Workaround for backreference #512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Kibukx opened this issue Sep 19, 2024 · 1 comment
Open

Workaround for backreference #512

Kibukx opened this issue Sep 19, 2024 · 1 comment

Comments

@Kibukx
Copy link

Kibukx commented Sep 19, 2024

I'm currently trying to get strings that contain 3 or more consecutive characters with the following regex:

SELECT *
FROM test.test
WHERE REGEXP_CONTAINS(first_name, r'(.)\1{2,}')

for example: "mikeeee", "dylaaan", etc but bigquery complains

Cannot parse regular expression: invalid escape sequence: \1

how can I get around this limitation since I'm aware that backreference isn't supported. Any help would be much appreciated!

@DecimalTurn
Copy link

DecimalTurn commented Mar 15, 2025

As far as I can tell, there isn't a simple workaround that works for backreference in general, but the one here can at least be generated easily by concatenating all possible letters (limiting ourselves to lowercase ASCII characters):

a{3,}|b{3,}|c{3,}|d{3,}|e{3,}|f{3,}|g{3,}|h{3,}|i{3,}|j{3,}|k{3,}|l{3,}|m{3,}|n{3,}|o{3,}|p{3,}|q{3,}|r{3,}|s{3,}|t{3,}|u{3,}|v{3,}|w{3,}|x{3,}|y{3,}|z{3,}

I know it's probably not the answer you were hoping for and it's not pretty, but that's the only option I see for your use case.

When the number of possible captures for what you need to reference is relatively small, you can usually get away with generating all the possibilities with a script like this one in JS for instance, but that approach might not be suitable in more complex problems:

function generateRegexPattern() {
    let pattern = Array.from({ length: 26 }, (_, i) => {
        let char = String.fromCodePoint(97 + i); // ASCII 97 is 'a'
        return `${char}{3,}`;
    }).join('|');

    return `${pattern}`;
}

console.log(generateRegexPattern());
Side note for disconnected matching

The fact that the backreference is looking at the very previous character makes the solution much easier than if the 2 instances of what we are matching are seperated. However, for simple examples we can still manage to do it. For instance, if we have a small subset of html tags that we want to match, we could generate a regex that matches any sequence for those tags:

eg.:

function generateRegexFromTemplate(template, values) {
    let pattern = values.map(value => template.replaceAll("{{matchingOption}}", value)).join('|');
    return `(${pattern})`;
}

// Example usage
let regexTemplate = "<{{matchingOption}}>.*?<\\/{{matchingOption}}>";
let matchingOptions = ["div", "span", "p", "a", "ul", "li", "table", "tr", "td"];

console.log(generateRegexFromTemplate(regexTemplate, matchingOptions));

Which gives:

(<div>.*?<\/div>|<span>.*?<\/span>|<p>.*?<\/p>|<a>.*?<\/a>|<ul>.*?<\/ul>|<li>.*?<\/li>|<table>.*?<\/table>|<tr>.*?<\/tr>|<td>.*?<\/td>)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants