Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Many serious TOML parsing/loading bugs caused by underlying toml library, consider switching? #439

Closed
pirate opened this issue Sep 24, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@pirate
Copy link

pirate commented Sep 24, 2024

Thanks so much for building python-benedict, it's really awesome!

So unfortunatley the underlying TOML library https://github.com/uiri/toml used by benedict has a bunch of long-time outstanding bugs that break parsing/loading and generally make it unsafe to load->dump->load the same string.

For example, you cannot dump any dict containing escape sequences without it throwing an exception:

>>> benedict({
	"reset": "\033[00;00m",
	"lightblue": "\033[01;30m",
}).to_toml()

...
File ~/test/.venv/lib/python3.11/site-packages/toml/encoder.py:113, in _dump_str(v)
    111     else:
    112         joiner = "u00"
--> 113     v = [v[0] + joiner + v[1]] + v[2:]
    114 return unicode('"' + v[0] + '"')

IndexError: list index out of range

But thats not it, there are many other fairly major string escaping, quoting, and parse/dump cycle inconsistency bugs that have bitten other projects using uiri/toml:

Almost all of these are 2yr+ old, indicating they're probably not going to all get fixed anytime soon without significant increase in velocity. No harm no foul, it's open source we can't demand they go faster and they don't owe us anything, but maybe benedict could consider a different library with fewer major outstanding parser consistency issues?

Is benedict open to switching to a library without these issues? Maybe one of these:

I'll chip in $20 towards the work to make the switch if it's an option ⬇️

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar
@pirate pirate added the bug Something isn't working label Sep 24, 2024
@pirate
Copy link
Author

pirate commented Sep 24, 2024

After looking more into the other options unfortunately they are worse in other ways, for example they can barely handle a fraction of the types that uiri/toml supports :/

I guess we are stuck writing a custom TomlEncoder class to work around the bugs for now.

import re
from pathlib import Path

import toml

def better_toml_dump_str(val):
    try:
        return toml.encoder._dump_str(val)
    except Exception:
        # if we hit any of toml's numerous encoding bugs,
        # fall back to using json representation of string
        return json.dumps(str(val), default=repr)

class CustomTOMLEncoder(toml.encoder.TomlEncoder):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        
        # override the dumper funcs for any types you need to support here:
        self.dump_funcs[str] 			= better_toml_dump_str
		self.dump_funcs[Path] 			= better_toml_dump_str
        self.dump_funcs[re.RegexFlag] 	= better_toml_dump_str
        ...
# this works:
>>> test_dict = {
	"string_with_escape_sequences": "\033[00;00m",
	"some_Path": "/some/path/example.txt",
	"example_weird_type": re.IGNORECASE | re.UNICODE | re.MULTILINE,
	# etc. try more weird types here
}
>>> print(benedict(test_dict).to_toml(encoder=CustomTOMLEncoder()))

string_with_escape_sequences = "\u001b[00;00m"
example_weird_type = "re.IGNORECASE|re.UNICODE|re.MULTILINE"

You're welcome to close this issue 🤷, up to you.

@fabiocaccamo
Copy link
Owner

@pirate thank you for reporting this problem.

Actually, for decoding toml on Python >= 3.11, the tomlib standard lib is used:
https://github.com/fabiocaccamo/python-benedict/blob/main/benedict/serializers/toml.py#L10

I think it would be cool to improve TOML support, would you like to submit a PR with a list of failing test cases (that should succeed with the perfect library)?
So it would be much easier to test other TOML libraries or doing fixes in a different way.

@fabiocaccamo
Copy link
Owner

@pirate is it in your plans to submit a PR for this?

@pirate
Copy link
Author

pirate commented Oct 30, 2024

I think given there are only two real python toml library options and you're already using the best one, I think it's not worth trying to patch the toml bugs on top within benedict.

You can leave it up to end users to figure out a workaround if they really need one (e.g. my CustomTOMLEncoder above). I just wanted to document this issue so people could find it via Google if they have the same problem and wondered the cause.

Here is a good test suite for TOML parsing though if you want one: https://github.com/toml-lang/toml-test

@pirate pirate closed this as completed Oct 30, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in Open Source Oct 30, 2024
@fabiocaccamo
Copy link
Owner

@pirate totally agree, it would make more sense to fix these issue at toml lib level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

2 participants