Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newline characters cause rows in PostgreSQL table to be broken inadvertently. #31

Open
YPCrumble opened this issue Jun 17, 2020 · 2 comments

Comments

@YPCrumble
Copy link

This issue replaces #30.

The issue is that user-inputted data that includes these newline characters:

  • \u2028
  • \u2029
  • \x85

causes the dump to think that the line is actually split into more than one. The result is that the dump raises:

ValueError("Mismatch between column names and values.")

To solve it I added the following to the Python processes:

    process = subprocess.Popen(
        (
            "pg_dump",
            # Force output to be UTF-8 encoded.
            "--encoding=utf-8",
            # Quote all table and column names, just in case.
            "--quote-all-identifiers",
            # Luckily `pg_dump` supports DB URLs, so we can just pass it the
            # URL as argument to the command.
            "--dbname",
            url.geturl().replace('postgis://', 'postgresql://'),
         ) + tuple(extra_params),
        stdout=subprocess.PIPE,
    )

    # Remove newline characters.
    process = subprocess.Popen(
        "sed $'s/\u2028/ /g'",
        shell=True,
        stdin=process.stdout,
        stdout=subprocess.PIPE)
    process = subprocess.Popen(
        "sed $'s/\u2029/ /g'",
        shell=True,
        stdin=process.stdout,
        stdout=subprocess.PIPE)
    process = subprocess.Popen(
        "sed $'s/\x85/ /g'",
        shell=True,
        stdin=process.stdout,
        stdout=subprocess.PIPE)

I'd be happy to add as a PR if it's helpful, or is there a better way to handle the issue?

@azin634
Copy link
Contributor

azin634 commented Dec 10, 2021

I had a similar issue in mysql. See if this fix would work #29

@YPCrumble
Copy link
Author

@azin634 this seems to help with the first two types of newlines, but not all. I'm now getting this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 1: invalid continuation byte

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants