Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: fix athena partitions limit #360

Merged
merged 9 commits into from
Aug 15, 2023
Merged

feat: fix athena partitions limit #360

merged 9 commits into from
Aug 15, 2023

Conversation

svdimchenko
Copy link
Contributor

@svdimchenko svdimchenko commented Jul 28, 2023

Description

Resolves: #87

Models used to test

For table materialization (iceberg):

{{
  config(
    schema='sandbox',
    materialized='table',
    partitioned_by=['DAY(date_column)', 'doy'],
    table_type='iceberg'
  )
}}

SELECT
    CAST(date_column AS DATE) as date_column,
    doy(date_column) as doy,
    random() as rnd
FROM (
    VALUES (
        SEQUENCE(FROM_ISO8601_DATE('2023-01-01'), FROM_ISO8601_DATE('2023-07-24'), INTERVAL '1' DAY)
    )
) AS t1(date_array)
CROSS JOIN UNNEST(date_array) AS t2(date_column)

For table materialization (hive):

{{
  config(
    schema='sandbox',
    materialized='table',
    partitioned_by=['date_column', 'doy'],
    table_type='hive'
  )
}}

SELECT
    random() as rnd,
    CAST(date_column AS DATE) as date_column,
    doy(date_column) as doy
FROM (
    VALUES (
        SEQUENCE(FROM_ISO8601_DATE('2020-01-01'), FROM_ISO8601_DATE('2023-07-24'), INTERVAL '1' DAY)
    )
) AS t1(date_array)
CROSS JOIN UNNEST(date_array) AS t2(date_column)

For incremental materialization:

{{
  config(
    schema='sandbox',
    materialized='incremental',
    incremental_strategy='merge',
    partitioned_by=['DAY(date_column)', 'doy'],
    table_type='iceberg',
    unique_key=['doy']
  )
}}

select * from {{ ref('mat_table') }}

Checklist

  • You followed contributing section
  • You kept your Pull Request small and focused on a single feature or bug fix.
  • You added unit testing when necessary
  • You added functional testing when necessary

@svdimchenko svdimchenko changed the title [feat] Fix athena partitions limit feat: fix athena partitions limit Jul 28, 2023
@svdimchenko svdimchenko marked this pull request as ready for review July 28, 2023 16:12
Jrmyy
Jrmyy previously approved these changes Jul 31, 2023
Copy link
Contributor

@Jrmyy Jrmyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing 🔥

@Jrmyy Jrmyy added the enable-functional-tests Label to trigger functional testing label Jul 31, 2023
dbt/adapters/athena/impl.py Outdated Show resolved Hide resolved
@Jrmyy
Copy link
Contributor

Jrmyy commented Aug 1, 2023

How is handled when we want to redo a certain batch ? Is there a way to delete overlapping partitions as it was done in the insert_by_period made by Jesse ?
Especially I don't see examples with is_incremental condition.

@svdimchenko
Copy link
Contributor Author

svdimchenko commented Aug 3, 2023

How is handled when we want to redo a certain batch ? Is there a way to delete overlapping partitions as it was done in the insert_by_period made by Jesse ? Especially I don't see examples with is_incremental condition.

@Jrmyy all I've changed the way how the tmp table is created and how the data from it is inserted into target relation. All other logic including is_incremental condition and delete overlapping partitions stay as they were before. I'm now testing these changes on my own project (only iceberg tables to be honest) and everything works OK. But of course, I would appreciate help with testing on different models to catch possible edge cases.

@nicor88 nicor88 requested a review from Jrmyy August 8, 2023 14:52
Copy link
Contributor

@nicor88 nicor88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svdimchenko this is an amazing piece of work 💯
I left an improvement comment - also do you think that we can add some integration tests where we test that the actually written data is correct:

  • integrity - e.g. the dataset return by sql mode match what we have in the final table
  • covered by the integrity check, but maybe a check to see that all the values of the partitions are in the actual final table

@nicor88
Copy link
Contributor

nicor88 commented Aug 14, 2023

@svdimchenko I did some more tests - let's fix the conflicts and merge ok? - I'm looking forward to having this merge before we release 1.6 :)

@nicor88 nicor88 self-requested a review August 14, 2023 17:49
Copy link
Contributor

@nicor88 nicor88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@nicor88 nicor88 merged commit ea3cd1d into dbt-labs:main Aug 15, 2023
9 checks passed
@antonysouthworth-halter
Copy link
Contributor

hey does this work for (non-iceberg) tables that are bucketed in addition to partitioned?

@nicor88
Copy link
Contributor

nicor88 commented Sep 25, 2023

AFIK for bucketing (hive and not hive) you won't get any partition limitation. Did you face partition limitations with bucketing?
Bucketing allows you to write objects with the same hash value in the same object, it does not create any prefix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enable-functional-tests Label to trigger functional testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partition Limitations
4 participants