Specifying external_location isn't cleaning up the existing data. #107

VDFaller · 2022-06-21T14:29:05Z

Problem

I'm not understanding the accepted way to make external_location dynamic from dev to production.

I unfortunately can't use the s3_staging_dir from profiles.yml because of a constraint of my IT. Also the standard {s3_staging_dir}/tables/uuid doesn't make it easy to know which data goes with which table while looking through s3.

Because right now if I run it twice I'm getting

HIVE_PATH_ALREADY_EXISTS: Target directory for table 'dev_test.tbl_test' already exists: s3://my_external_location/PARQUET_STORE/dev_test/tbl_test. You may need to manually clean the data at location 's3://my_s3_staging_dir/tables/4d8dcf55-5fc0-424d-8668-0b6f182fb254' before retrying. Athena will not delete data in your account.

my tbl_test is simply

{{ config(materialized='table', external_location='s3://my_external_location/PARQUET_STORE/dev_test/tbl_test/')  }}

SELECT 'Test' AS "Fun"

I'm hoping I don't have to manually drop the table

Pipe dream

I'd love it if I could have the external_location in the form of

s3://my-bucket/some_extra_path/tbl_schema/tbl_name

Where I give the external_location prefix (different one for dev, qa, prod) and the file would know what table schema it is going to and what the table name is without me having to configure it.

Possible fix

PR 74 seems to fix the issue.

The text was updated successfully, but these errors were encountered:

ojaved-equip · 2022-06-24T15:52:43Z

The tables not getting cleaned up was resolved by #73

For making external_location dynamic in environments I specify a different staging directory for dev vs. prod and use the target.s3_staging_dir variable when setting the external location.

{{ config(materialized='table', external_location=target.s3_staging_dir+'_tables/[table name]') }}

VDFaller · 2022-06-24T19:36:00Z

I saw that one, but I updated to the latest version of DBT and it's still happening. Has #73 not been released yet?

Also, I can't use the same bucket for staging as I do for external_location (a constraint from my IT).
I tried to add a ex_loc var to my profiles and do

{{ config(materialized='table', external_location=target.ex_loc+target.schema+'/table_workorder_component/')  }}

But that's no bueno.

'ex_loc' is undefined. This can happen when calling a macro that does not exist. Check for typos and/or install package dependencies with "dbt deps".

ojaved-equip · 2022-06-24T20:09:31Z

Did you update just DBT or dbt-athena plugin? I just updated the dbt-athena plugin and its working fine for me.

I'm not sure if there is a way to get around the different buckets but did see #50 still open which looks like what we really need.

VDFaller · 2022-06-24T20:46:03Z

Just double checked and upgraded again and it's still giving me the issue. And from the code change of #73 that looks like it only has to do with incremental tables. Mine is not incremental. Maybe that's the disconnect?

and yeah #50 looks like what I'm looking for for dev env things.

mrshu · 2022-06-24T20:48:58Z

@VDFaller #73 has already been released (in the sense that it's in the master branch). That change now applies to all tables, not just the incremental ones.

VDFaller · 2022-06-24T21:41:18Z

@mrshu Well I'm still getting this issue. /shrug I'll try dbt-glue

VDFaller · 2022-07-07T18:02:45Z

Okay so I think I found the problem. clean_up_table is using my default credentials, (which don't have permission to get table for this environment, not even remote related actually). So it's failing. And since the error code was AccessDeniedException it was failing blindly. Cascading into something hard to debug.

I changed my profile to use default and changed that to the credentials that actually matter for this and it's running fine. With or without external location.

I also changed the function to this and it worked even with default being unrelated.

    @available
    def clean_up_table(
        self, database_name: str, table_name: str
    ):
        # Look up Glue partitions & clean up
        conn = self.connections.get_thread_connection()
        session = boto3.session.Session(profile_name=conn._credentials.aws_profile_name)
        with boto3_client_lock:
            glue_client = session.client('glue')
        try:
            print("trying to get table")
            table = glue_client.get_table(
                DatabaseName=database_name,
                Name=table_name
            )
        except ClientError as e:
            if e.response['Error']['Code'] == 'EntityNotFoundException':
                logger.debug("Table '{}' does not exists - Ignoring", table_name)
                return
        if table is not None:
            logger.debug("Deleting table data from'{}'", table["Table"]["StorageDescriptor"]["Location"])
            p = re.compile('s3://([^/]*)/(.*)')
            m = p.match(table["Table"]["StorageDescriptor"]["Location"])
            if m is not None:
                bucket_name = m.group(1)
                prefix = m.group(2)
                s3_resource = session.resource('s3')
                s3_bucket = s3_resource.Bucket(bucket_name)
                s3_bucket.objects.filter(Prefix=prefix).delete()

Edit Oh but this won't work for you CI Pipeline since you use env vars. crap.

Gatsby-Lee · 2022-07-21T23:37:29Z

@VDFaller Thank you for sharing your research in the issue!!
I had the similar issue like sth failed blindly.

VDFaller changed the title ~~Documentation: Process to change external_location by environment~~ Specifying external_location isn't cleaning up the existing data. Jun 21, 2022

VDFaller mentioned this issue Jul 19, 2022

Seeds fail to update when already applied #113

Open

Avinash-1394 pushed a commit to Avinash-1394/dbt-athena that referenced this issue Jan 12, 2023

Update __version__.py (Tomme#107)

1281338

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specifying external_location isn't cleaning up the existing data. #107

Specifying external_location isn't cleaning up the existing data. #107

VDFaller commented Jun 21, 2022 •

edited

Loading

ojaved-equip commented Jun 24, 2022

VDFaller commented Jun 24, 2022 •

edited

Loading

ojaved-equip commented Jun 24, 2022

VDFaller commented Jun 24, 2022

mrshu commented Jun 24, 2022

VDFaller commented Jun 24, 2022

VDFaller commented Jul 7, 2022 •

edited

Loading

Gatsby-Lee commented Jul 21, 2022

Specifying external_location isn't cleaning up the existing data. #107

Specifying external_location isn't cleaning up the existing data. #107

Comments

VDFaller commented Jun 21, 2022 • edited Loading

Problem

Pipe dream

Possible fix

ojaved-equip commented Jun 24, 2022

VDFaller commented Jun 24, 2022 • edited Loading

ojaved-equip commented Jun 24, 2022

VDFaller commented Jun 24, 2022

mrshu commented Jun 24, 2022

VDFaller commented Jun 24, 2022

VDFaller commented Jul 7, 2022 • edited Loading

Gatsby-Lee commented Jul 21, 2022

VDFaller commented Jun 21, 2022 •

edited

Loading

VDFaller commented Jun 24, 2022 •

edited

Loading

VDFaller commented Jul 7, 2022 •

edited

Loading