Skip to content

Commit

Permalink
version to 0.3.6 - add seed_format and seed_mode
Browse files Browse the repository at this point in the history
  • Loading branch information
menuetb committed Nov 24, 2022
1 parent 90aace4 commit 70b8af7
Show file tree
Hide file tree
Showing 5 changed files with 50 additions and 29 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
## dbt-glue 1.0.0 (Release TBD)

## v0.3.6
- improvement of seed
- add seed_format and seed_mode to configuration

## v0.3.5
- Add error checking for HUDI incremental materializations
- Specify location for tmp table
Expand Down
46 changes: 24 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,28 +208,30 @@ location: "s3://dbt_demo_bucket/dbt_demo_data"
The table below describes all the options.
|Option | Description | Mandatory |
|---|----------------------------------------------------------------------------------------------------------------------------------------|---|
|project_name | The dbt project name. This must be the same as the one configured in the dbt project. |yes|
|type | The driver to use. |yes|
|query-comment | A string to inject as a comment in each query that dbt runs. |no|
|role_arn | The ARN of the glue interactive session IAM role. |yes|
|region | The AWS Region were you run the data pipeline. |yes|
|workers | The number of workers of a defined workerType that are allocated when a job runs. |yes|
|worker_type | The type of predefined worker that is allocated when a job runs. Accepts a value of Standard, G.1X, or G.2X. |yes|
|schema | The schema used to organize data stored in Amazon S3.Additionally, is the database in AWS Lake Formation that stores metadata tables in the Data Catalog. |yes|
|session_provisioning_timeout_in_seconds | The timeout in seconds for AWS Glue interactive session provisioning. |yes|
|location | The Amazon S3 location of your target data. |yes|
|query_timeout_in_seconds | The timeout in seconds for a signle query. Default is 300 |no|
|idle_timeout | The AWS Glue session idle timeout in minutes. (The session stops after being idle for the specified amount of time) |no|
|glue_version | The version of AWS Glue for this session to use. Currently, the only valid options are 2.0 and 3.0. The default value is 3.0. |no|
|security_configuration | The security configuration to use with this session. |no|
|connections | A comma-separated list of connections to use in the session. |no|
|conf | Specific configuration used at the startup of the Glue Interactive Session (arg --conf) |no|
|extra_py_files | Extra python Libs that can be used by the interactive session. |no|
|delta_athena_prefix | A prefix used to create Athena compatible tables for Delta tables (if not specified, then no Athena compatible table will be created) |no|
|tags | The map of key value pairs (tags) belonging to the session. Ex: `KeyName1=Value1,KeyName2=Value2` |no|
|default_arguments | The map of key value pairs parameters belonging to the session. More information on [Job parameters used by AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html). Ex: `--enable-continuous-cloudwatch-log=true,--enable-continuous-log-filter=true` |no|
| Option | Description | Mandatory |
|-----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
| project_name | The dbt project name. This must be the same as the one configured in the dbt project. |yes|
| type | The driver to use. |yes|
| query-comment | A string to inject as a comment in each query that dbt runs. |no|
| role_arn | The ARN of the glue interactive session IAM role. |yes|
| region | The AWS Region were you run the data pipeline. |yes|
| workers | The number of workers of a defined workerType that are allocated when a job runs. |yes|
| worker_type | The type of predefined worker that is allocated when a job runs. Accepts a value of Standard, G.1X, or G.2X. |yes|
| schema | The schema used to organize data stored in Amazon S3.Additionally, is the database in AWS Lake Formation that stores metadata tables in the Data Catalog. |yes|
| session_provisioning_timeout_in_seconds | The timeout in seconds for AWS Glue interactive session provisioning. |yes|
| location | The Amazon S3 location of your target data. |yes|
| query_timeout_in_seconds | The timeout in seconds for a signle query. Default is 300 |no|
| idle_timeout | The AWS Glue session idle timeout in minutes. (The session stops after being idle for the specified amount of time) |no|
| glue_version | The version of AWS Glue for this session to use. Currently, the only valid options are 2.0 and 3.0. The default value is 3.0. |no|
| security_configuration | The security configuration to use with this session. |no|
| connections | A comma-separated list of connections to use in the session. |no|
| conf | Specific configuration used at the startup of the Glue Interactive Session (arg --conf) |no|
| extra_py_files | Extra python Libs that can be used by the interactive session. |no|
| delta_athena_prefix | A prefix used to create Athena compatible tables for Delta tables (if not specified, then no Athena compatible table will be created) |no|
| tags | The map of key value pairs (tags) belonging to the session. Ex: `KeyName1=Value1,KeyName2=Value2` |no|
| seed_format | By default `parquet`, can be Spark format compatible like `csv` or `json` |no|
| seed_mode | By default `overwrite`, the seed data will be overwritten, you can set it to `append` if you just want to add new data in your dataset |no|
| default_arguments | The map of key value pairs parameters belonging to the session. More information on [Job parameters used by AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html). Ex: `--enable-continuous-cloudwatch-log=true,--enable-continuous-log-filter=true` |no|

## Configs

Expand Down
4 changes: 4 additions & 0 deletions dbt/adapters/glue/credentials.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ class GlueCredentials(Credentials):
delta_athena_prefix: Optional[str] = None
tags: Optional[str] = None
database: Optional[str] # type: ignore
seed_format: Optional[str] = "parquet"
seed_mode: Optional[str] = "overwrite"
default_arguments: Optional[str] = None

@property
Expand Down Expand Up @@ -73,5 +75,7 @@ def _connection_keys(self):
'extra_py_files',
'delta_athena_prefix',
'tags',
'seed_format',
'seed_mode',
'default_arguments'
]
23 changes: 17 additions & 6 deletions dbt/adapters/glue/impl.py
Original file line number Diff line number Diff line change
Expand Up @@ -394,17 +394,28 @@ def create_csv_table(self, model, agate_table):
logger.debug(model)
f = io.StringIO("")
agate_table.to_json(f)
if session.credentials.seed_mode == "overwrite":
mode = "True"
else:
mode = "False"

code = f'''
custom_glue_code_for_dbt_adapter
csv = {f.getvalue()}
df = spark.createDataFrame(csv)
table_name = '{model["schema"]}.{model["name"]}'
writer = (
df.write.mode("append")
.format("parquet")
.option("path", "{session.credentials.location}/{model["schema"]}/{model["name"]}")
)
writer.saveAsTable(table_name, mode="append")
try:
# if the table exists, add data
df.write\
.mode("{session.credentials.seed_mode}")\
.format("{session.credentials.seed_format}")\
.insertInto(table_name, overwrite={mode})
except:
# create a table and add data
df.write\
.option("path", "{session.credentials.location}/{model["schema"]}/{model["name"]}")\
.format("{session.credentials.seed_format}")\
.saveAsTable(table_name)
SqlWrapper2.execute("""select * from {model["schema"]}.{model["name"]} limit 1""")
'''
try:
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
long_description = f.read()

package_name = "dbt-glue"
package_version = "0.3.5"
package_version = "0.3.6"
dbt_version = "1.3.0"
description = """dbt (data build tool) adapter for Aws Glue"""
setup(
Expand Down

0 comments on commit 70b8af7

Please sign in to comment.