Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying external_location isn't cleaning up the existing data. #107

Open
VDFaller opened this issue Jun 21, 2022 · 8 comments
Open

Specifying external_location isn't cleaning up the existing data. #107

VDFaller opened this issue Jun 21, 2022 · 8 comments

Comments

@VDFaller
Copy link

VDFaller commented Jun 21, 2022

Problem

I'm not understanding the accepted way to make external_location dynamic from dev to production.

I unfortunately can't use the s3_staging_dir from profiles.yml because of a constraint of my IT. Also the standard {s3_staging_dir}/tables/uuid doesn't make it easy to know which data goes with which table while looking through s3.

Because right now if I run it twice I'm getting

HIVE_PATH_ALREADY_EXISTS: Target directory for table 'dev_test.tbl_test' already exists: s3://my_external_location/PARQUET_STORE/dev_test/tbl_test. You may need to manually clean the data at location 's3://my_s3_staging_dir/tables/4d8dcf55-5fc0-424d-8668-0b6f182fb254' before retrying. Athena will not delete data in your account.

my tbl_test is simply

{{ config(materialized='table', external_location='s3://my_external_location/PARQUET_STORE/dev_test/tbl_test/')  }}

SELECT 'Test' AS "Fun"

I'm hoping I don't have to manually drop the table

Pipe dream

I'd love it if I could have the external_location in the form of

s3://my-bucket/some_extra_path/tbl_schema/tbl_name

Where I give the external_location prefix (different one for dev, qa, prod) and the file would know what table schema it is going to and what the table name is without me having to configure it.

Possible fix

PR 74 seems to fix the issue.

@VDFaller VDFaller changed the title Documentation: Process to change external_location by environment Specifying external_location isn't cleaning up the existing data. Jun 21, 2022
@ojaved-equip
Copy link

The tables not getting cleaned up was resolved by #73

For making external_location dynamic in environments I specify a different staging directory for dev vs. prod and use the target.s3_staging_dir variable when setting the external location.

{{ config(materialized='table', external_location=target.s3_staging_dir+'_tables/[table name]') }}

@VDFaller
Copy link
Author

VDFaller commented Jun 24, 2022

I saw that one, but I updated to the latest version of DBT and it's still happening. Has #73 not been released yet?

Also, I can't use the same bucket for staging as I do for external_location (a constraint from my IT).
I tried to add a ex_loc var to my profiles and do

{{ config(materialized='table', external_location=target.ex_loc+target.schema+'/table_workorder_component/')  }}

But that's no bueno.

'ex_loc' is undefined. This can happen when calling a macro that does not exist. Check for typos and/or install package dependencies with "dbt deps".

@ojaved-equip
Copy link

Did you update just DBT or dbt-athena plugin? I just updated the dbt-athena plugin and its working fine for me.

I'm not sure if there is a way to get around the different buckets but did see #50 still open which looks like what we really need.

@VDFaller
Copy link
Author

Just double checked and upgraded again and it's still giving me the issue. And from the code change of #73 that looks like it only has to do with incremental tables. Mine is not incremental. Maybe that's the disconnect?

and yeah #50 looks like what I'm looking for for dev env things.

@mrshu
Copy link
Contributor

mrshu commented Jun 24, 2022

@VDFaller #73 has already been released (in the sense that it's in the master branch). That change now applies to all tables, not just the incremental ones.

@VDFaller
Copy link
Author

@mrshu Well I'm still getting this issue. /shrug I'll try dbt-glue

@VDFaller
Copy link
Author

VDFaller commented Jul 7, 2022

Okay so I think I found the problem. clean_up_table is using my default credentials, (which don't have permission to get table for this environment, not even remote related actually). So it's failing. And since the error code was AccessDeniedException it was failing blindly. Cascading into something hard to debug.

I changed my profile to use default and changed that to the credentials that actually matter for this and it's running fine. With or without external location.

I also changed the function to this and it worked even with default being unrelated.

    @available
    def clean_up_table(
        self, database_name: str, table_name: str
    ):
        # Look up Glue partitions & clean up
        conn = self.connections.get_thread_connection()
        session = boto3.session.Session(profile_name=conn._credentials.aws_profile_name)
        with boto3_client_lock:
            glue_client = session.client('glue')
        try:
            print("trying to get table")
            table = glue_client.get_table(
                DatabaseName=database_name,
                Name=table_name
            )
        except ClientError as e:
            if e.response['Error']['Code'] == 'EntityNotFoundException':
                logger.debug("Table '{}' does not exists - Ignoring", table_name)
                return
        if table is not None:
            logger.debug("Deleting table data from'{}'", table["Table"]["StorageDescriptor"]["Location"])
            p = re.compile('s3://([^/]*)/(.*)')
            m = p.match(table["Table"]["StorageDescriptor"]["Location"])
            if m is not None:
                bucket_name = m.group(1)
                prefix = m.group(2)
                s3_resource = session.resource('s3')
                s3_bucket = s3_resource.Bucket(bucket_name)
                s3_bucket.objects.filter(Prefix=prefix).delete()

Edit Oh but this won't work for you CI Pipeline since you use env vars. crap.

@Gatsby-Lee
Copy link

@VDFaller Thank you for sharing your research in the issue!!
I had the similar issue like sth failed blindly.

Avinash-1394 pushed a commit to Avinash-1394/dbt-athena that referenced this issue Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants