Skip to content

Releases: astronomer/astro-sdk

1.1.0b2

16 Sep 23:45
Compare
Choose a tag to compare
1.1.0b2 Pre-release
Pre-release

Features

  • Add native autodetect schema feature (#780)
  • Allow users to disable auto addition of inlets/outlets via airflow.cfg (#858)

Improvements

  • Avoid loading whole file into memory with load_operator for schema detection (#805)
  • Directly pass the file to native library when native support is enabled (#802)

Bug fixes

  • Add compat module for typing execute context in operators (#770)
  • Fix sql injection issues (#807)
  • Stop generating Datasets for temp tables (#862)(#871)

Docs

  • Update quick start example (#819)
  • Add links to docs from README (#832)
  • Fix Astro CLI doc link (#842)
  • Add docs on Dataset (AIP-48) support (#852)
  • Add configuration details from settings.py (#861)

1.1.0b1

10 Sep 00:48
Compare
Choose a tag to compare
1.1.0b1 Pre-release
Pre-release

Features

  • Add support for Redshift (#639, #753, #700)

  • Support for Datasets introduced in Airflow 2.4 (#786, #808)

    • inlets and outlets will be automatically set for all the operators.

    • Users can now schedule DAGs on File and Table objects. Example:

      input_file = File(
          path="https://raw.githubusercontent.com/astronomer/astro-sdk/main/tests/data/imdb_v2.csv"
      )
      imdb_movies_table = Table(name="imdb_movies", conn_id="sqlite_default")
      top_animations_table = Table(name="top_animation", conn_id="sqlite_default")
      START_DATE = datetime(2022, 9, 1)
      
      
      @aql.transform()
      def get_top_five_animations(input_table: Table):
          return """
              SELECT title, rating
              FROM {{input_table}}
              WHERE genre1='Animation'
              ORDER BY rating desc
              LIMIT 5;
          """
      
      
      with DAG(
          dag_id="example_dataset_producer",
          schedule=None,
          start_date=START_DATE,
          catchup=False,
      ) as load_dag:
          imdb_movies = aql.load_file(
              input_file=input_file,
              task_id="load_csv",
              output_table=imdb_movies_table,
          )
      
      with DAG(
          dag_id="example_dataset_consumer",
          schedule=[imdb_movies_table],
          start_date=START_DATE,
          catchup=False,
      ) as transform_dag:
          top_five_animations = get_top_five_animations(
              input_table=imdb_movies_table,
              output_table=top_animations_table,
          )
  • Dynamic Task Templates: Tasks that can be used with Dynamic Task Mapping (Airflow 2.3+)

    • Get list of files from a Bucket - get_file_list (#596)
    • Get list of values from a DB - get_value_list (#673)
  • Create upstream_tasks parameter for dependencies independent of data transfers (#585)

Bug fixes

  • Add response_size to run_raw_sql and warn about db thrashing (#815)

Docs

  • Add section explaining table metadata (#774)
  • Fix docstring for run_raw_sql (#817)
  • Add missing docs for Table class (#788)
  • Add the readme.md example dag to example dags folder (#681)
  • Add reason for enabling XCOM pickling (#747)

1.0.2

24 Aug 17:49
Compare
Choose a tag to compare

Bug fixes

  • Skip folders while processing paths in load_file operator when file pattern is passed. #733

Misc

  • Limit Google Protobuf for compatibility with bigquery client. #742

1.0.1

23 Aug 19:54
Compare
Choose a tag to compare

Bug fixes

  • Added a check to create table only when if_exists is replace in aql.load_file for snowflake. #729
  • Fix the file type for NDJSON file in Data transfer job in AWS S3 to Google BigQuery. #724
  • Create a new version of imdb.csv with lowercase column names and update the examples to use it, so this change is backwards-compatible. #721, #727
  • Skip folders while processing paths in load_file operator when file patterns is passed. #733

Docs

  • Updated the Benchmark docs for GCS to Snowflake and S3 to Snowflake of aql.load_file #712#707

  • Restructured the documentation in the project.toml, quickstart, readthedocs and README.md #698, #704, #706

  • Make astro-sdk-python compatible with major version of Google Providers. #703

Misc

  • Consolidate the documentation requirements for sphinx. #699
  • Add CI/CD triggers on release branches with dependency on tests. #672

cc: @kaxil @tatiana @dimberman @utkarsharma2 @sunank200 @pankajastro @pankajkoti @vikramkoka

1.0.0

18 Aug 19:24
Compare
Choose a tag to compare

Summary

Features

  • Improved the performance of aql.load_file by supporting database-specific (native) load methods.
    This is now the default behaviour. Previously, the Astro SDK Python would always use Pandas to load files to
    SQL databases which passed the data to worker node which slowed the performance. (#557, #481)

    Introduced new arguments to aql.load_file:

    • use_native_support for data transfer if available on the destination (defaults to use_native_support=True)
    • native_support_kwargs is a keyword argument to be used by method involved in native support flow.
    • enable_native_fallback can be used to fall back to default transfer(defaults to enable_native_fallback=True).

    Now, there are three modes:

    • Native: Default, uses Bigquery Load Job in the
      case of BigQuery and Snowflake COPY INTO
      using external stage in the case of Snowflake.
    • Pandas: This is how datasets were previously loaded. To enable this mode, use the argument
      use_native_support=False in aql.load_file.
    • Hybrid: This attempts to use the native strategy to load a file to the database and if native strategy(i)
      fails , fallback to Pandas (ii) with relevant log warnings. #557
  • Allow users to specify the table schema (column types) in which a file is being loaded by using table.columns.
    If this table attribute is not set, the Astro SDK still tries to infer the schema by using Pandas
    (which is previous behaviour).#532

  • Add Example DAG for Dynamic Map Task with Astro-SDK.
    #377,airflow-2.3.0

Breaking Change

  • The aql.dataframe argument identifiers_as_lower (which was boolean, with default set to False)
    was replaced by the argument columns_names_capitalization (string within possible values
    ["upper", "lower", "original"], default is lower).#564
  • The aql.load_file before would change the capitalization of all column titles to be uppercase, by default,
    now it makes them lowercase, by default. The old behaviour can be achieved by using the argument
    columns_names_capitalization="upper". #564
  • aql.load_file attempts to load files to BigQuery and Snowflake by using native methods, which may have
    pre-requirements to work. To disable this mode, use the argument use_native_support=False in aql.load_file.
    #557, #481
  • aql.dataframe will raise an exception if the default Airflow XCom backend is being used.
    To solve this, either use an external XCom backend, such as S3 or GCS
    or set the configuration AIRFLOW__ASTRO_SDK__DATAFRAME_ALLOW_UNSAFE_STORAGE=True. #444
  • Change the declaration for the default Astro SDK temporary schema from using AIRFLOW__ASTRO__SQL_SCHEMA
    to AIRFLOW__ASTRO_SDK__SQL_SCHEMA #503
  • Renamed aql.truncate to aql.drop_table #554

Bug fixes

  • Fix missing airflow's task terminal states to CleanupOperator #525
  • Allow chaining aql.drop_table (previously truncate) tasks using the Task Flow API syntax. #554, #515

Enhancements

  • Improved the performance of aql.load_file for files for below:
    • From AWS S3 to Google BigQuery up to 94%. #429, #568
    • From Google Cloud Storage to Google BigQuery up to 93%. #429, #562
    • From AWS S3/Google Cloud Storage to Snowflake up to 76%. #430, #544
    • From GCS to Postgres in K8s up to 93%. #428, #531
  • Get configurations via Airflow Configuration manager. #503
  • Change catching ValueError and AttributeError to DatabaseCustomError #595
  • Unpin pandas upperbound dependency #620
  • Remove markupsafe from dependencies #623
  • Added extend_existing to Sqla Table object #626
  • Move config to store DF in XCom to settings file #537
  • Make the operator names consistent #634
  • Use exc_info for exception logging #643
  • Update query for getting bigquery table schema #661
  • Use lazy evaluated Type Annotations from PEP 563 #650
  • Provide Google Cloud Credentials env var for bigquery #679
  • Handle breaking changes for Snowflake provide version 3.2.0 and 3.1.0 #686

What's Changed (Full Changelog)

Read more

0.11.1

17 Aug 11:27
dc989ce
Compare
Choose a tag to compare

Bug fix:

  • Pass operator kwargs to dataframe decorator #630

1.0.0b1

27 Jul 10:48
3cd86da
Compare
Choose a tag to compare

Feature:

  • Improved the performance of aql.load_file by supporting database-specific (native) load methods. This is now the default behaviour. Previously, the Astro SDK Python would always use Pandas to load files to SQL databases which passed the data to worker node which slowed the performance. #557, #481

    Introduced new arguments to aql.load_file:

    • use_native_support for data transfer if available on the destination (defaults to use_native_support=True)
    • native_support_kwargs is a keyword argument to be used by method involved in native support flow.
    • enable_native_fallback can be used to fall back to default transfer(defaults to enable_native_fallback=True).

    Now, there are three modes:

    • Native: Default, uses Bigquery Load Job in the case of BigQuery and Snowflake COPY INTO using external stage in the case of Snowflake.
    • Pandas: This is how datasets were previously loaded. To enable this mode, use the argument use_native_support=False in aql.load_file.
    • Hybrid: This attempts to use the native strategy to load a file to the database and if native strategy(i) fails , fallback to Pandas (ii) with relevant log warnings.
  • Allow users to specify the table schema (column types) in which a file is being loaded by using table.columns. If this table attribute is not set, the Astro SDK still tries to infer the schema by using Pandas (which is previous behaviour).#532

  • Implement fallback mechanism in case native support fails to default option with log warning for problem with native support. #557

  • Add Example DAG for Dynamic Map Task with Astro-SDK. #377,airflow-2.3.0

Community:

  • Allow running tests on PRs from forks + label #179

Breaking Change:

  • The aql.dataframe argument identifiers_as_lower (which was boolean, with default set to False) was replaced by the argument columns_names_capitalization (string within possible values ["upper", "lower", "original"], default is lower).#564
  • The aql.load_file before would change the capitalization of all column titles to be uppercase, by default, now it makes them lowercase, by default. The old behaviour can be achieved by using the argument columns_names_capitalization="upper". #564
  • aql.load_file attempts to load files to BigQuery and Snowflake by using native methods, which may have pre-requirements to work. To disable this mode, use the argument use_native_support=False in aql.load_file. #557, #481
  • aql.dataframe will raise an exception if the default Airflow XCom backend is being used. To solve this, either use an external XCom backend, such as S3 or GCS or set the configuration AIRFLOW__ASTRO_SDK__DATAFRAME_ALLOW_UNSAFE_STORAGE=True. #444
  • Change the declaration for the default Astro SDK temporary schema from using AIRFLOW__ASTRO__SQL_SCHEMA to AIRFLOW__ASTRO_SDK__SQL_SCHEMA #503
  • Renamed aql.truncate to aql.drop_table #554

Bug fix:

  • Fix missing airflow's task terminal states to CleanupOperator #525
  • Allow chaining aql.drop_table (previously truncate) tasks using the Task Flow API syntax. #554, #515

Enhancement:

  • Improved the performance of aql.load_file for files from AWS S3 to Google BigQuery up to 94%. #429, #568
  • Improved the performance of aql.load_file for files from Google Cloud Storage to Google BigQuery up to 93%. #429, #562
  • Improved the performance of aql.load_file for files from AWS S3/Google Cloud Storage to Snowflake up to 76%. #430, #544
  • Improved the performance of aql.load_file for files from GCS to Postgres in K8s up to 93%. #428, #531
  • Fix sphinx docs sidebar #472
  • Get configurations via Airflow Configuration manager. #503
  • Add CI job to check for dead links #526

@tatiana @kaxil @dimberman @utkarsharma2 @sunank200 @pankajastro @jlaneve @guohui-gao @mikeshwe @vikramkoka

0.11.0

05 Jul 10:39
be6280d
Compare
Choose a tag to compare

Feature:

  • Added Cleanup operator to clean temporary tables #187 #436

Internals:

  • Added a Pull Request template #205
  • Added sphinx documentation for readthedocs #276 #472

Enhancement:

  • Fail LoadFile operator when input_file does not exist #467
  • Create scripts to launch benchmark testing to Google cloud #432
  • Bump Google Provider for google extra #294

0.10.0

21 Jun 11:31
f9cd9c2
Compare
Choose a tag to compare

Feature:

  • Allow list and tuples as columns names in Append & Merge Operators #343, #435

Breaking Change:

  • aql.merge interface changed. Argument merge_table changed to target_table, target_columns and merge_column combined to column argument, merge_keys is changed to target_conflict_columns, conflict_strategy is changed to if_conflicts. More details can be found at 422, #466

Enhancement:

  • Document (new) load_file benchmark datasets #449
  • Made improvement to benchmark scripts and configurations #458, #434, #461, #460, #437, #462
  • Performance evaluation for loading datasets with Astro Python SDK 0.9.2 into BigQuery #437

@tatiana @kaxil @utkarsharma2 @dimberman @sunank200 @mikeshwe @vikramkoka

0.9.2

13 Jun 14:18
Compare
Choose a tag to compare

Bug fix: