Releases: astronomer/astro-sdk
1.1.0b2
Features
- Add native autodetect schema feature (#780)
- Allow users to disable auto addition of inlets/outlets via airflow.cfg (#858)
Improvements
- Avoid loading whole file into memory with load_operator for schema detection (#805)
- Directly pass the file to native library when native support is enabled (#802)
Bug fixes
- Add compat module for typing execute
context
in operators (#770) - Fix sql injection issues (#807)
- Stop generating Datasets for temp tables (#862)(#871)
Docs
1.1.0b1
Features
-
Support for Datasets introduced in Airflow 2.4 (#786, #808)
-
inlets
andoutlets
will be automatically set for all the operators. -
Users can now schedule DAGs on
File
andTable
objects. Example:input_file = File( path="https://raw.githubusercontent.com/astronomer/astro-sdk/main/tests/data/imdb_v2.csv" ) imdb_movies_table = Table(name="imdb_movies", conn_id="sqlite_default") top_animations_table = Table(name="top_animation", conn_id="sqlite_default") START_DATE = datetime(2022, 9, 1) @aql.transform() def get_top_five_animations(input_table: Table): return """ SELECT title, rating FROM {{input_table}} WHERE genre1='Animation' ORDER BY rating desc LIMIT 5; """ with DAG( dag_id="example_dataset_producer", schedule=None, start_date=START_DATE, catchup=False, ) as load_dag: imdb_movies = aql.load_file( input_file=input_file, task_id="load_csv", output_table=imdb_movies_table, ) with DAG( dag_id="example_dataset_consumer", schedule=[imdb_movies_table], start_date=START_DATE, catchup=False, ) as transform_dag: top_five_animations = get_top_five_animations( input_table=imdb_movies_table, output_table=top_animations_table, )
-
-
Dynamic Task Templates: Tasks that can be used with Dynamic Task Mapping (Airflow 2.3+)
-
Create upstream_tasks parameter for dependencies independent of data transfers (#585)
Bug fixes
- Add response_size to run_raw_sql and warn about db thrashing (#815)
Docs
1.0.2
1.0.1
Bug fixes
- Added a check to create table only when
if_exists
isreplace
inaql.load_file
for snowflake. #729 - Fix the file type for NDJSON file in Data transfer job in AWS S3 to Google BigQuery. #724
- Create a new version of imdb.csv with lowercase column names and update the examples to use it, so this change is backwards-compatible. #721, #727
- Skip folders while processing paths in load_file operator when file patterns is passed. #733
Docs
-
Updated the Benchmark docs for GCS to Snowflake and S3 to Snowflake of
aql.load_file
#712#707 -
Restructured the documentation in the
project.toml
, quickstart, readthedocs and README.md #698, #704, #706 -
Make astro-sdk-python compatible with major version of Google Providers. #703
Misc
- Consolidate the documentation requirements for sphinx. #699
- Add CI/CD triggers on release branches with dependency on tests. #672
cc: @kaxil @tatiana @dimberman @utkarsharma2 @sunank200 @pankajastro @pankajkoti @vikramkoka
1.0.0
Summary
Features
-
Improved the performance of
aql.load_file
by supporting database-specific (native) load methods.
This is now the default behaviour. Previously, the Astro SDK Python would always use Pandas to load files to
SQL databases which passed the data to worker node which slowed the performance. (#557, #481)Introduced new arguments to
aql.load_file
:use_native_support
for data transfer if available on the destination (defaults touse_native_support=True
)native_support_kwargs
is a keyword argument to be used by method involved in native support flow.enable_native_fallback
can be used to fall back to default transfer(defaults toenable_native_fallback=True
).
Now, there are three modes:
Native
: Default, uses Bigquery Load Job in the
case of BigQuery and Snowflake COPY INTO
using external stage in the case of Snowflake.Pandas
: This is how datasets were previously loaded. To enable this mode, use the argument
use_native_support=False
inaql.load_file
.Hybrid
: This attempts to use the native strategy to load a file to the database and if native strategy(i)
fails , fallback to Pandas (ii) with relevant log warnings. #557
-
Allow users to specify the table schema (column types) in which a file is being loaded by using
table.columns
.
If this table attribute is not set, the Astro SDK still tries to infer the schema by using Pandas
(which is previous behaviour).#532 -
Add Example DAG for Dynamic Map Task with Astro-SDK.
#377,airflow-2.3.0
Breaking Change
- The
aql.dataframe
argumentidentifiers_as_lower
(which wasboolean
, with default set toFalse
)
was replaced by the argumentcolumns_names_capitalization
(string
within possible values
["upper", "lower", "original"]
, default islower
).#564 - The
aql.load_file
before would change the capitalization of all column titles to be uppercase, by default,
now it makes them lowercase, by default. The old behaviour can be achieved by using the argument
columns_names_capitalization="upper"
. #564 aql.load_file
attempts to load files to BigQuery and Snowflake by using native methods, which may have
pre-requirements to work. To disable this mode, use the argumentuse_native_support=False
inaql.load_file
.
#557, #481aql.dataframe
will raise an exception if the default Airflow XCom backend is being used.
To solve this, either use an external XCom backend, such as S3 or GCS
or set the configurationAIRFLOW__ASTRO_SDK__DATAFRAME_ALLOW_UNSAFE_STORAGE=True
. #444- Change the declaration for the default Astro SDK temporary schema from using
AIRFLOW__ASTRO__SQL_SCHEMA
toAIRFLOW__ASTRO_SDK__SQL_SCHEMA
#503 - Renamed
aql.truncate
toaql.drop_table
#554
Bug fixes
- Fix missing airflow's task terminal states to
CleanupOperator
#525 - Allow chaining
aql.drop_table
(previouslytruncate
) tasks using the Task Flow API syntax. #554, #515
Enhancements
- Improved the performance of
aql.load_file
for files for below: - Get configurations via Airflow Configuration manager. #503
- Change catching
ValueError
andAttributeError
toDatabaseCustomError
#595 - Unpin pandas upperbound dependency #620
- Remove markupsafe from dependencies #623
- Added
extend_existing
to Sqla Table object #626 - Move config to store DF in XCom to settings file #537
- Make the operator names consistent #634
- Use
exc_info
for exception logging #643 - Update query for getting bigquery table schema #661
- Use lazy evaluated Type Annotations from PEP 563 #650
- Provide Google Cloud Credentials env var for bigquery #679
- Handle breaking changes for Snowflake provide version 3.2.0 and 3.1.0 #686
What's Changed (Full Changelog)
- Get configs via Airflow Configuration manager by @utkarsharma2 in #505
- Load files from GCS to Bigquery using BigqueryHook by @utkarsharma2 in #489
- Fix benchmark permissions by @tatiana in #513
- Fix benchmark to work without table metadata by @tatiana in #514
- Add performance result to load_file to snowflake using Python SDK 0.11.0 by @sunank200 in #480
- Fix docs sidebar by @dimberman in #517
- Benchmark postgres by @dimberman in #510
- Rename
BaseSQLOperator
class by @dimberman in #518 - Read default config if airflow's isn't defined by @utkarsharma2 in #520
- Add Snowflake stage methods by @tatiana in #523
- Fix broken link to tutorial in README by @jlaneve in #526
- Simplify debugging issues when building docs by @tatiana in #527
- create a possible solution to users passing large dataframes between … by @dimberman in #522
- Use native path for S3 to Bigquery in load_file operator by @utkarsharma2 in #519
- Update links in Contribution Guidelines section in README by @josh-fell in #536
- Add CI job to check for dead links by @kaxil in #528
- Adjust storage integration so it is consistent for AWS and GCP by @tatiana in #539
- improve docs to @kaxil and @tatiana's comments by @dimberman in #521
- Refactor table creation in load_file by @tatiana in #538
- Add missing task terminal states to CleanUp Operator by @utkarsharma2 in #540
- Optimize postgres performance by @dimberman in #531
- Benchmark reporting to expose the information from GCS to Markdown by @sunank200 in #547
- Refactor db.load_file_to_table, make json config optional by @tatiana in #549
- Allow running tests on PRs from forks + label by @kaxil in #546
- Add native path from local to bigquery by @utkarsharma2 in #535
- Fix MyPY issue of 'path' and 'conn_id' property of Class File by @utkarsharma2 in #545
- Improve (Sphinx) gitignore by @tatiana in #548
- Handle nrows for export_to_dataframe() by @utkarsharma2 in #559
- Add benchmarking result from GCS to Bigquery after optimization by @sunank200 in #563
- Optimize Snowflake load_file using native COPY INTO by @tatiana in #544
- Fix DistutilsOptionError #570 by @tatiana in #571
- Add benchmarking results for S3 to Bigquery transfer by @utkarsharma2 in #568
- ...
0.11.1
1.0.0b1
Feature:
-
Improved the performance of
aql.load_file
by supporting database-specific (native) load methods. This is now the default behaviour. Previously, the Astro SDK Python would always use Pandas to load files to SQL databases which passed the data to worker node which slowed the performance. #557, #481Introduced new arguments to
aql.load_file
:use_native_support
for data transfer if available on the destination (defaults touse_native_support=True
)native_support_kwargs
is a keyword argument to be used by method involved in native support flow.enable_native_fallback
can be used to fall back to default transfer(defaults toenable_native_fallback=True
).
Now, there are three modes:
Native
: Default, uses Bigquery Load Job in the case of BigQuery and Snowflake COPY INTO using external stage in the case of Snowflake.Pandas
: This is how datasets were previously loaded. To enable this mode, use the argumentuse_native_support=False
inaql.load_file
.Hybrid
: This attempts to use the native strategy to load a file to the database and if native strategy(i) fails , fallback to Pandas (ii) with relevant log warnings.
-
Allow users to specify the table schema (column types) in which a file is being loaded by using
table.columns
. If this table attribute is not set, the Astro SDK still tries to infer the schema by using Pandas (which is previous behaviour).#532 -
Implement fallback mechanism in case native support fails to default option with log warning for problem with native support. #557
-
Add Example DAG for Dynamic Map Task with Astro-SDK. #377,airflow-2.3.0
Community:
- Allow running tests on PRs from forks + label #179
Breaking Change:
- The
aql.dataframe
argumentidentifiers_as_lower
(which wasboolean
, with default set toFalse
) was replaced by the argumentcolumns_names_capitalization
(string
within possible values["upper", "lower", "original"]
, default islower
).#564 - The
aql.load_file
before would change the capitalization of all column titles to be uppercase, by default, now it makes them lowercase, by default. The old behaviour can be achieved by using the argumentcolumns_names_capitalization="upper"
. #564 aql.load_file
attempts to load files to BigQuery and Snowflake by using native methods, which may have pre-requirements to work. To disable this mode, use the argumentuse_native_support=False
inaql.load_file
. #557, #481aql.dataframe
will raise an exception if the default Airflow XCom backend is being used. To solve this, either use an external XCom backend, such as S3 or GCS or set the configurationAIRFLOW__ASTRO_SDK__DATAFRAME_ALLOW_UNSAFE_STORAGE=True
. #444- Change the declaration for the default Astro SDK temporary schema from using
AIRFLOW__ASTRO__SQL_SCHEMA
toAIRFLOW__ASTRO_SDK__SQL_SCHEMA
#503 - Renamed
aql.truncate
toaql.drop_table
#554
Bug fix:
- Fix missing airflow's task terminal states to
CleanupOperator
#525 - Allow chaining
aql.drop_table
(previouslytruncate
) tasks using the Task Flow API syntax. #554, #515
Enhancement:
- Improved the performance of
aql.load_file
for files from AWS S3 to Google BigQuery up to 94%. #429, #568 - Improved the performance of
aql.load_file
for files from Google Cloud Storage to Google BigQuery up to 93%. #429, #562 - Improved the performance of
aql.load_file
for files from AWS S3/Google Cloud Storage to Snowflake up to 76%. #430, #544 - Improved the performance of
aql.load_file
for files from GCS to Postgres in K8s up to 93%. #428, #531 - Fix sphinx docs sidebar #472
- Get configurations via Airflow Configuration manager. #503
- Add CI job to check for dead links #526
@tatiana @kaxil @dimberman @utkarsharma2 @sunank200 @pankajastro @jlaneve @guohui-gao @mikeshwe @vikramkoka
0.11.0
Feature:
Internals:
Enhancement:
0.10.0
Feature:
Breaking Change:
aql.merge
interface changed. Argumentmerge_table
changed totarget_table
,target_columns
andmerge_column
combined tocolumn
argument,merge_keys
is changed totarget_conflict_columns
,conflict_strategy
is changed toif_conflicts
. More details can be found at 422, #466
Enhancement:
- Document (new) load_file benchmark datasets #449
- Made improvement to benchmark scripts and configurations #458, #434, #461, #460, #437, #462
- Performance evaluation for loading datasets with Astro Python SDK 0.9.2 into BigQuery #437
@tatiana @kaxil @utkarsharma2 @dimberman @sunank200 @mikeshwe @vikramkoka