Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[destination-postgres] Intelligently handle truncation of long column names #35268

Closed
1 task
alvinwanyeki opened this issue Feb 14, 2024 · 9 comments
Closed
1 task
Labels
area/connectors Connector related issues certified community connectors/destination/postgres frozen Not being actively worked on team/destinations Destinations team's backlog type/bug Something isn't working

Comments

@alvinwanyeki
Copy link

alvinwanyeki commented Feb 14, 2024

Connector Name

source-airtable

Connector Version

4.1.6

What step the error happened?

During the sync

Relevant information

Issue

  • When syncing data from Airtable to a database or datawarehouse destination. Columns exceeding the character limit in the destination are trunctated (expected behavior).
  • Airbyte fails to intelligently handle instances where multiple columns from the same table that have the same 'leading' characters which leads the sync failing because the destination throws an error around there being multiple instances of the same column.

Illustration

  • If we have table foo from Airtable with the following columns (note that the last word is different across both columns):
    • this_is_a_column_with_an_overly_exceedingly_long_name
    • this_is_a_column_with_an_overly_exceedingly_long_title
  • The above columns will be truncated to:
    • this_is_a_column_with_an_overly_exceedingly_long_
    • this_is_a_column_with_an_overly_exceedingly_long_
  • This will lead to the db throwing an error (and the sync failing) because the columns have the same name.

Suggested Solution

  • Airbyte should detect instances where column names being truncated will lead to instances of different columns ending up with the same name -- it could append a number to the column (the nth instance of the column), so the fields in the example above would be:
    • this_is_a_column_with_an_overly_exceedingly_long1
    • this_is_a_column_with_an_overly_exceedingly_long2

Relevant log output

No response

Contribute

  • Yes, I want to contribute
@marcosmarxm
Copy link
Member

What destination are you using? Are you still using normalization or Dv2 destination?

@alvinwanyeki
Copy link
Author

@marcosmarxm -- we're using destination-postgres, Dv2.

@marcosmarxm
Copy link
Member

Thanks @alvinwanyeki it looks the problem should be solved by the destination and not by the source. I'll update the ticket.

@marcosmarxm marcosmarxm changed the title [source-airtable] Intelligently handle truncation of long column names [destination-postgres] Intelligently handle truncation of long column names Feb 15, 2024
@marcosmarxm
Copy link
Member

@gisripa can you take a look in this issue. Is this expected?

@jbfbell
Copy link
Contributor

jbfbell commented Feb 20, 2024

We think other destinations will also have this problem and will look into a broader solution

@gisripa
Copy link
Contributor

gisripa commented Feb 20, 2024

@marcosmarxm Yeah the default behavior in postgres is to truncate. we plan to handle this as part of certification later.

@jbfbell
Copy link
Contributor

jbfbell commented Mar 1, 2024

Looked into this more today - we would like to give users more control over the name result and don't want to make assumptions in the destination. As a bandaid, we created https://github.com/airbytehq/airbyte-internal-issues/issues/6645 to at least detect when this happens and throw an error.

If you are running your own postgres instance, there is a workaround but its not great, from the docs

The system uses no more than NAMEDATALEN-1 bytes of an identifier; longer names can be written in commands, but they will be truncated. By default, NAMEDATALEN is 64 so the maximum identifier length is 63 bytes. If this limit is problematic, it can be raised by changing the NAMEDATALEN constant in src/include/pg_config_manual.h.

So you can change this value, but you have to recompile postgres - which is likely not possible for managed postgres offerings.

@bleonard bleonard added the frozen Not being actively worked on label Mar 22, 2024
@jbfbell jbfbell added frozen Not being actively worked on and removed frozen Not being actively worked on labels Mar 26, 2024
@edgao
Copy link
Contributor

edgao commented Apr 10, 2024

@alvinwanyeki can you give destination-postgres v2.0.8 a try? It includes #36805, which adds somewhat better handling for long column names. Specifically, it keeps the start and end of the long name, and cuts out the middle section.

It's not perfect (the collision checker does a dumb truncation for the first colliding name), so e.g. if your destination has a 16-char limit, and your stream has two columns example_very_long_name_the_first and example_very_long_name_the_second you'll end up with

  1. example_very_lon
  2. examp23econd (23 = the number of characters removed)

but that's just to keep backwards-compatibility with previous logic (not married to it, but it's a bigger + breaking change to do a consistent transform over all columns. tracking in https://github.com/airbytehq/airbyte-internal-issues/issues/6989 for improving the overall logic)

@alvinwanyeki
Copy link
Author

@edgao, sorry for the delayed response here -- thanks for the update, i can confirm that the name collisions no longer break a sync on >v2.0.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues certified community connectors/destination/postgres frozen Not being actively worked on team/destinations Destinations team's backlog type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants