Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datatable parsing, wrongly escape regexp "\w" #364

Open
neskk opened this issue Feb 4, 2025 · 5 comments
Open

Datatable parsing, wrongly escape regexp "\w" #364

neskk opened this issue Feb 4, 2025 · 5 comments
Labels
🐛 bug Defect / Bug

Comments

@neskk
Copy link

neskk commented Feb 4, 2025

👓 What did you see?

I use some regexp on data-tables to perform some assertions.
Parsing the datatable below:

And the response body contains:
  | var1      | [matches regexp] abc\|cde     |
  | var2      | [matches regexp] \w+\|cde     |
  | var3      | [matches regexp] \\w+\|cde    |

I get ['var2', '[matches regexp] \\\\w+|cde'] and ['var3', '[matches regexp] \\\\w+|cde'] , which breaks my matcher.

✅ What did you expect to see?

I would expect the second row parsed to be:

  • ['var2', '[matches regexp] \w+|cde']
    but instead I get:
  • ['var2', '[matches regexp] \\\\w+|cde']
    which breaks my matcher.

I would expect the third row parsed to be:

  • ['var3', '[matches regexp] \w+|cde']
    but instead I also get:
  • ['var3', '[matches regexp] \\\\w+|cde']

📦 Which tool/library version are you using?

python 3.10
pytest-bdd 8.1.0
gherkin-official 29.0.0

🔬 How could we reproduce it?

  1. Create a step-defintion that expects a datatable.
  2. Create a feature file that submits the datatable with \w+ or other regex pattern.
  3. Log/print the received datatable.

📚 Any additional context?

No response

@neskk
Copy link
Author

neskk commented Feb 4, 2025 via email

@mpkorstanje
Copy link
Contributor

Oh yeah. That's definitely a bug.

@mpkorstanje mpkorstanje added the 🐛 bug Defect / Bug label Feb 4, 2025
@neskk
Copy link
Author

neskk commented Feb 4, 2025

I was using a custom data-table parser I built and it handles these situations much better:

# match all '|' char, without '\' char behind
COLUMN_SPLIT_REGEXP = re.compile(r"(?<!\\)\|")

def parse_datatable(input_str: str) -> list[list[str]]:
    res = []
    for line in input_str.split("\n"):
        line = line.strip()  # noqa: PLW2901
        if not line:
            continue  # skip empty lines
        if line.startswith("#"):
            continue  # skip comment lines

        cells = [col.strip().replace(r"\|", "|") for col in COLUMN_SPLIT_REGEXP.split(line)]

        # discard content before and after the table delimiter
        if cells[0] != "" or cells[-1] != "":
            raise ValueError("failed to parse datatable: bad syntax")
        res.append(cells[1:-1])

    return res

@mpkorstanje
Copy link
Contributor

mpkorstanje commented Feb 4, 2025

Spaces are not required to separate the pipes. So at a glance, that would fail against

|hello|world|
|\\|\|\||

Which should contain hello, world, \ and ||.

We do have a test case to cover this functionality. So kinda surprising it passes, but Feel free to look into this deeper!

@neskk
Copy link
Author

neskk commented Feb 4, 2025

My implementation indeed fails with your example, but the python gherkin-parser also fails to return the correct content:

|hello|world|
|\\|\|\||

returns:

[['hello', 'world'], ['\\\\', '||']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Defect / Bug
Projects
None yet
Development

No branches or pull requests

2 participants