Skip to content

Releases: datahub-project/datahub

DataHub v0.12.1

08 Dec 23:44
159a013
Compare
Choose a tag to compare

Release Highlights

New Features

SQLAlchemy Source Enhancements: Support for view lineage across all SQLAlchemy sources (PR #9039).
Airflow Integration: Retry callback and support for ExternalTaskSensor subclasses added (PR #8514).
Kafka Enhancements: Increased Kafka message size and enabled compression (PR #9038).
JSONSchema Ingestion: Enabled schema-aware JsonSchemaTranslator (PR #8971).
Search Bar Improvements: Added a flag to hide/display the autocomplete query (PR #9104).
SQL Parser Performance: Enhancements and asyncio fixes (PR #9119).
MongoDB Ingestion: Support for stateful ingestion and improved schema inference for lists (PR #9118, PR #9145).
Policy Engine Updates: Refactoring and enhancements, including support for 10k+ policies (PR #9163, PR #9177).
UI Enhancements: Numerous improvements including command-k icons in the search bar, updated Apollo cache, and auto-complete debounce in the search bar (PR #9194, PR #9193, PR #9205).
Fivetran Integration: Connector integration for Fivetran (PR #9018).
Neo4j Database Support: Connection to specific Neo4j databases now supported (PR #9179).
Chart Subtypes in UI: Support for chart subtypes (PR #9186).

Fixes and Improvements

BigQuery Fixes: Resolved issues with lineage filter query, and fixed extracting comments from complex types (PR #9114, PR #8950).
MongoDB Refactoring: Platform instance addition to MongoDB (PR #8663).
Kafka Setup: Adjusted truststore settings for PEM files (PR #8656).
REST API Authorization: Fixed rollback failure when authorization is enabled (PR #9092).
Java Exception Handling: Addressed java.util.ConcurrentModificationException (PR #9090).
UI and Documentation: Fixed filtering logic in UI, corrected documentation errors, and added feature guides (PR #9116, PR #9125, PR #9124, PR #9126, PR #9134, PR #9137, PR #9122, PR #9068).
SQL Server and Snowflake Ingestion: Updated queries and fixed missing view downstream call (PR #9127, PR #8966).
ClickHouse and DB2 Ingestion: Addressed column reflection regression and table properties handling (PR #9143, PR #9128).
Ingestion Improvements: Numerous fixes and enhancements across various ingestion sources (PR #9153, PR #9155, PR #9141, PR #9157, PR #9123).
CI and Build Process: Tweaked workflows, increased gradle retries, and addressed CI errors (PR #9052, PR #9091, PR #9160).
Security Updates: Addressed a zookeeper CVE and other security concerns (PR #9190).
UI Refactoring: Improved entity page loading indicators and renamed button texts (PR #9195, PR #9196).
Policy and Auth Enhancements: Refactored policy locking and added roles to policy engine validation logic (PR #9178).

Testing and Continuous Integration

API Testing: Added tests for managing secrets, access token privilege, and flaky tests fix (PR #9121, PR #9167, PR #9132, PR #9175).
Cypress Test Fixes: Addressed glossary navigation and download_lineage_results tests (PR #9175, PR #9132).
Cleanup and Refactoring
Ingestion Cleanup: Removed legacy memory_leak_detector and refactored ingestion sources (PR #9158, PR #9135, PR #9120, PR #9105).
PDL Refactoring: Refactored Assertion model enums (PR #9191).
Build and Deployment
Release Preparation: Updated files for the 0.12.0 release (PR #9130).

What's Changed

  • feat(ingest): support view lineage for all sqlalchemy sources by @mayurinehate in #9039
  • fix(ingest/bigquery): Fixing lineage filter query by @treff7es in #9114
  • refactor(ingestion/mongodb): Add platform_instance to mongodb by @nicholas-fwang in #8663
  • fix(kafka-setup): Don't set truststore pass for PEM files by @mmmeeedddsss in #8656
  • fix(ingest): Fix roll back failure when REST_API_AUTHORIZATION_ENABLED is set to true by @TonyOuyangGit in #9092
  • (fix): Avoid java.util.ConcurrentModificationException by @rtekal in #9090
  • Fix(ingest/bigquery): fix extracting comments from complex types by @maaaikoool in #8950
  • docs: add versions 0.12.0 by @yoonhyejin in #9125
  • fix(ui) Fix filtering logic for everwhere generating OR filters by @chriscollins3456 in #9116
  • build(release): Update files for 0.12.0 release by @pedro93 in #9130
  • fix(ingest/sql-server): update queries to use escaped procedure name by @mayurinehate in #9127
  • feat(airflow): retry callback, support ExternalTaskSensor subclasses by @richenc in #8514
  • docs: fix saasonly flags for some pages by @yoonhyejin in #9124
  • fix(ingest/snowflake): missing view downstream cll if platform instance is set by @mayurinehate in #8966
  • feat: Add flag to hide/display the autocomplete query for search bar by @kushagra-apptware in #9104
  • docs(timeline): correct markdown heading level by @mayurinehate in #9126
  • docs(graphql) Correct mutation -> query for searchAcrossLineage examples by @eboneil in #9134
  • feat(kafka): increase kafka message size and enable compression by @david-leifker in #9038
  • feat(ingest/jsonschema) enable schema-aware JsonSchemaTranslator by @KulykDmytro in #8971
  • fix(metadata-ingestion): adds default value to _resolved_domain_urn i… by @alexklavensnyt in #9115
  • ci: tweak to only run relevant workflows by @anshbansal in #9052
  • Fix for flaky download_lineage_results cypress test by @kkorchak in #9132
  • docs: Update updating-datahub.md by @pedro93 in #9131
  • fix(ingest/clickhouse): pin version to solve column reflection regression by @hsheth2 in #9143
  • feat(ingest/looker): cleanup error handling by @hsheth2 in #9135
  • feat(ingest): add entity_supports_aspect helper by @hsheth2 in #9120
  • feat(sqlparser): support more update syntaxes + fix bug with subqueries by @hsheth2 in #9105
  • docs: correct broken doc links by @sachinsaju in #9137
  • feat(ingest): sql parser perf + asyncio fixes by @hsheth2 in #9119
  • feat(quickstart): fix broker InconsistentClusterIdException issues by @hsheth2 in #9148
  • fix(policies): remove non-existent policies, fix name by @anshbansal in #9150
  • Fix for a test that passed on Oss and failed on Saas by @kkorchak in #9147
  • docs(teradata): teradata doc external link 404 fix by @sachinsaju in #9152
  • fix(datahub-client): Include relocation for snakeyaml dependency. by @jiateoh in #8911
  • fix(ingest): cleanup large images in CI by @hsheth2 in #9153
  • build: increase gradle retries by @hsheth2 in #9091
  • feat(ingest): bump sqlglot parser by @hsheth2 in #9155
  • feat(ingest/mongodb): support stateful ingestion by @TonyOuyangGit in #9118
  • API test for managing secrets privilege by @kkorchak in #9121
  • fix(ingest): handle exceptions in min, max, mean profiling by @mayurinehate in #9129
  • feat: rename Assets tab to Owner Of by @kushagra-apptware in #9141
  • fix(ingest/mongodb): fix schema inference for lists of values by @hsheth2 in #9145
  • fix(ingest/db2): fix handling for table properties by @deepgarg-visa in #9128
  • fix(ingest): fully support MCPs in urn_iter primitive by @hsheth2 in #9157
  • fix(ingest/bigquery): use correct row count in null count profiling c… by @mayurinehate in #9123
  • docs: add feature guides for subscriptions and notifications by @yoonhyejin in #9122
  • docs: unify oidc guides using tabs by @yoonhyejin in #9068
  • chore(ingest): remove legacy memory_leak_detector by @hsheth2 in #9158
  • feat(ingest/looker): support emitting unused explores by @hsheth2 in #9159
  • refactor(policy): refactor policy locking, no functional difference by @david-leifker in #9163
  • API test for managing access token privilege by @kkorchak in #9167
  • fix(mysql-setup): quote database name by @darnaut in #9169
  • fix(health): fix health check ...
Read more

v0.12.1rc2

28 Nov 14:22
ac7fa56
Compare
Choose a tag to compare
v0.12.1rc2 Pre-release
Pre-release

What's Changed

Full Changelog: v0.12.1...v0.12.1rc2

v0.12.0

26 Oct 10:26
2ebf33e
Compare
Choose a tag to compare

v0.12.0 Release Highlights

User Experience

Nested Domains

Nested Domains are here! This provides flexibility in organizing your entities within Domains to match the unique organizational structure of your company.
CleanShot 2023-10-27 at 14 30 43@2x

DataHub Chrome Extension Improvements

The Acryl DataHub Chome extension now supports PowerBI! This is a super powerful way for your business users to gain DataHub-specific insights directly in the BI tools they use most. Additionally, we now support making edits back to DataHub Entities directly from the Chrome extension.

Access Management Tab for Datasets

Shoutout to @Ramendra761 from the PayPal Team for contributing a new Access Management tab in Dataset Entity pages! The aim of this feature is to enable users to view the required roles for accessing the Dataset, as defined by Roles and/or Policies in the organization’s Access Management System. It also introduces the ability to request access directly from the page.
CleanShot 2023-10-27 at 14 09 51@2x

Metadata Ingestion

Miscellaneous Improvements

  • Sampling-Based Profiling: You can now configure sampling-based profiling to address query performance concerns in Snowflake and BigQuery
  • Kafka Connect > Snowflake: We now support automatically defining lineage between the two platforms
  • Athena: Support for complex and nested schemas

Column-Level Lineage

We are incubating CLL support for the following:

  • Airflow plugin v2 now supports automatic extraction of CLL for certain operators, removing the need to annotate DAGs
  • dbt
  • Redshift
  • PowerBI (support for Column-Level Lineage for M-Query)

Incubating Sources

  • MLflow
  • Teradata
  • Unity Catalog Notebooks
  • DynamoDB

Developer Experience

  • Data Contracts: v0.12.0 introduces underlying models and CLI; UI support to follow
  • We now support creating custom models without requiring a fork of the main DataHub project
  • Updates to support OpenSearch 2.x and alternate Postgres db in postgres-setup

Other Notable Changes

  • Session token configuration has changed, all previously created session tokens will be invalid and users will be prompted to log in. Expiration time has also been shortened which may result in more login prompts with the default settings.
    There should be no other interruption due to this change.

Breaking Changes

Find full details here

  • #9044 - GraphQL APIs for adding ownership now expect either an ownershipTypeUrn referencing a customer ownership type or a (deprecated) type. Where before adding an ownership without a concrete type was allowed, this is no longer the case. For simplicity you can use the type parameter which will get translated to a custom ownership type internally if one exists for the type being added.
  • #9010 - In Redshift source's config incremental_lineage is set default to off.
  • #8810 - Removed support for SQLAlchemy 1.3.x. Only SQLAlchemy 1.4.x is supported now.
  • #8942 - Removed urn:li:corpuser:datahub owner for the Measure, Dimension and Temporal tags emitted
    by Looker and LookML source connectors.
  • #8853 - The Airflow plugin no longer supports Airflow 2.0.x or Python 3.7. See the docs for more details.
  • #8853 - Introduced the Airflow plugin v2. If you're using Airflow 2.3+, the v2 plugin will be enabled by default, and so you'll need to switch your requirements to include pip install 'acryl-datahub-airflow-plugin[plugin-v2]'. To continue using the v1 plugin, set the DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN environment variable to true.
  • #8943 - The Unity Catalog ingestion source has a new option include_metastore, which will cause all urns to be changed when disabled.
    This is currently enabled by default to preserve compatibility, but will be disabled by default and then removed in the future.
    If stateful ingestion is enabled, simply setting include_metastore: false will perform all required cleanup.
    Otherwise, we recommend soft deleting all databricks data via the DataHub CLI:
    datahub delete --platform databricks --soft and then reingesting with include_metastore: false.
  • #8846 - Changed enum values in resource filters used by policies. RESOURCE_TYPE became TYPE and RESOURCE_URN became URN.
    Any existing policies using these filters (i.e. defined for particular urns or types such as dataset) need to be upgraded
    manually, for example by retrieving their respective dataHubPolicyInfo aspect and changing part using filter i.e.
   "resources": {
     "filter": {
       "criteria": [
         {
           "field": "RESOURCE_TYPE",
           "condition": "EQUALS",
           "values": [
             "dataset"
           ]
         }
       ]
     }

into

   "resources": {
     "filter": {
       "criteria": [
         {
           "field": "TYPE",
           "condition": "EQUALS",
           "values": [
             "dataset"
           ]
         }
       ]
     }

for example, using datahub put command. Policies can also be removed and re-created via UI.

  • #9077 - The BigQuery ingestion source by default sets match_fully_qualified_names: true. This means that any dataset_pattern or schema_pattern specified will be matched on the fully qualified dataset name, i.e. <project_name>.<dataset_name>. We attempt to support the old pattern format by prepending .*\\. to dataset patterns lacking a period, so in most cases this should not cause any issues. However, if you have a complex dataset pattern, we recommend you manually convert it to the fully qualified format to avoid any potential issues.

What's Changed

  • feat(UI): AccessManagement UI to access the role metadata for a dataset by @Ramendra761 in #8541
  • Glossary Navigation Cypress test by @kkorchak in #8804
  • ci: upgrade python to 3.10 for builds by @hsheth2 in #8808
  • feat(ingestion/looker): Add view file-path as option in view_naming_pattern config by @siddiquebagwan-gslab in #8713
  • feat(upgrade): add ability to provide a startingOffset for RestoreIndices by @ukayani in #8539
  • fix(index): Do not override the search analyzer for ngram fields by @iprentic in #8818
  • test(managed_ingestion): fix managed ingestion test by fixing actions… by @david-leifker in #8820
  • docs: add 0.11 docs to docs site by @hsheth2 in #8813
  • docs(release): Update updating-datahub.md for 0.11.0 release by @iprentic in #8821
  • fix(ingest/mssql): Add UNIQUEIDENTIFIER data type as String by @cjm98332 in #8642
  • build(ingest): upgrade to sqlalchemy 1.4, drop 1.3 support by @mayurinehate in #8810
  • fix(ingest): use epoch 1 for dev build versions by @hsheth2 in #8824
  • ci: make wheel builds more robust by @hsheth2 in #8815
  • feat(cli): fix upload ingest cli endpoint by @pedro93 in #8826
  • docs(transformer): fix names in sample code of 'pattern_add_dataset_domain' by @Starkie in #8755
  • fix(siblingsHook): check number of dbtUpstreams instead of all upStreams by @ethan-cartwright in #8817
  • fix(java) Update DataProductMapper to always return a name by @chriscollins3456 in #8832
  • build(ingest): Bump jsonschema for Python >= 3.8 by @asikowitz in #8836
  • feat(ingest/rest-emitter): Do not raise error on retry failure to get better error messages by @asikowitz in #8837
  • ci: add markdown-link-check by @yoonhyejin in #8771
  • docs(managed datahub): release notes 0.2.11 by @anshbansal in #8830
  • build(ingest): Remove constraint on jsonschema for Python >= 3.8 by @asikowitz in #8842
  • fix(build): clean task cleanup generated src by @anshbansal in #8844
  • feat(ci): disable ingestion smoke build by @anshbansal in #8845
  • fix: fix quickstart page by @yoonhyejin in #8784
  • feat(bigquery): add better timers around every API call by @mayurinehate in #8626
  • feat(ingestion/dynamodb): Add DynamoDB as new metadata ingestion source by @TonyOuyangGit in #8768
  • feat(ingest/bigquery): support bigquery profiling with sampling by @mayurinehate in #8794
  • Fix for edit_documentation and glossary_navigation cypress tests by @kkorchak in #8838
  • feat(ui/java) Update domains to be nested by @chriscollins3456 in #8841
  • dcs(ml-models): enhancing ml model documentation ...
Read more

v0.11.0

11 Sep 16:39
68ae3bf
Compare
Choose a tag to compare

Release Highlights

Potential Downtime

This release introduces substantial improvements to search ranking which require reindexing indices.

During the reindexing:

  • a system-update job will set indices to read-only and create a backup/clone of each index
  • new components will be prevented from start-up until the reindex completes
  • Helm deployments will go into read-only mode and new ingestion runs will fail

This process can take anywhere from 5 minutes to multiple hours; as a rough estimate, please expect it to take 1 hour for every 2.3 million entities. After the reindex is complete, please check your ingestion run to re-run any that did not complete.

User Experience

New Search and Browse Experience

We have some really exciting improvements to the DataHub user experience in this release! The new search and browse experience, which was first made available in the previous release behind a feature flag, is now on by default. Check out our release notes for v0.10.5 to get more information and documentation on this new Browse experience.

Improvements to Search

In addition to the ranking changes mentioned above, this release includes changes to the highlighting of search entities to understand why they match your query. You can also sort your results alphabetically or by last updated times, in addition to relevance. In this release, we suggest a correction if your query has a typo in it.

Manage Home Page Posts

In this release we now enable you to create and delete pinned announcements on your DataHub homepage! If you have the “Manage Home Page Posts” platform privilege you’ll see a new section in settings called “Home Page Posts” where you can create and delete text posts and link posts that your users see on the home page.

OpenAPI Endpoints Expanded

OpenAPI entity and aspect endpoints expanded to improve developer experience when using this API with additional aspects to be added in the near future.

Metadata ingestion

Added support for Confluent S3 Sink Connector, extracting stored procedures and jobs from mssql, and snowflake shares. Additionally, sql parsing source now converts query logs into CLL and usage.

Developer Experience

The CLI now supports recursive deletes.

Versioned documentation

Starting from this release, we support versioned documentation on the datahub docs site! Select the version you’re on and browse docs specifically at that version.

Performance Improvements

  • Batching of default aspects on initial ingestion (SQL)
  • Improvements to multi-threading. Ingestion recipes, if previously reduced to 1 thread, can be restored to the 15 thread default.
  • Gradle 7 upgrade moderately improves build speed
  • DataHub Ingestion slim images reduced in size by 2GB+

Important Bug Fixes

  • Glue Schema Registry fixed

Deprecation Notice

  • MAE Events are no longer produced. MAE events have been deprecated for over a year.

What's Changed

  • feat(ingest/presto-on-hive): enable partition key for presto-on-hive by @zheyu001 in #8380
  • feat(classification): allow parallelisation to reduce time by @mayurinehate in #8368
  • feat(ingest): Add metabase database id to platform instance mapping by @k-popov in #8359
  • feat(ingest): add ability to read other method types than GET for OAS ingest recipes by @jsmilkstein in #8303
  • fix(ingest): fix data platform urn in dataset_urn_to_key and dataset_key_to_urn by @Masterchen09 in #8209
  • fix(ingest/s3): wrong sorting in case of multi-partition key by @anshbansal in #8536
  • fix(ingest/presto): fix presto on hive test failures by @hsheth2 in #8548
  • Cypress test for managing groups by @kkorchak in #8520
  • feat(ingest/kafka-connect): add support for Confluent S3 Sink Connector by @tusharm in #8298
  • Variable rename - Allows deselection of members in add members modal for a group by @Sukeerthi31 in #8529
  • fix(ingest/s3): catch no such bucket exception instead of failing by @anshbansal in #8549
  • fix(ingest): add tableau sqlglot dep by @hsheth2 in #8552
  • fix(ingetion/mssql): convert dataset urns to lowercase by @siddiquebagwan in #8551
  • Fix flaky add_user smoke test by @kkorchak in #8471
  • feat(ci): use docker registry cache by @hsheth2 in #8544
  • fix(glue): restore glue configurations by @RyanHolstien in #8533
  • build(release): Update files for 0.10.5 release by @iprentic in #8556
  • docs(release): Update updating-datahub.md for 0.10.5 release by @iprentic in #8557
  • feat(ingestion/snowflake): use user email-id in urn generation for top users stat by @siddiquebagwan in #8513
  • docs(development.md): Minor grammatical error by @PauloGoncalvesLima in #8558
  • fix(usage): Update index lifecycle policy to not delete old datahub usage events by @iprentic in #8565
  • fix(ui): Simplify background color for Entity Health Status popover by @jjoyce0510 in #8559
  • fix: add --write args on pre-commit prettier by @yoonhyejin in #8560
  • docs(observe): Add feature doc for Freshness Assertions by @jjoyce0510 in #8547
  • docs(updating): add details on Unified Search & Browse experience by @maggiehays in #8568
  • fix: fix features section by @yoonhyejin in #8571
  • feat(ingest): allow lower freq profiling based on date of month/day of week by @anshbansal in #8489
  • fix(stats): default to 3 months by @anshbansal in #8566
  • fix(aspect): count query only for relevant aspect index by @iprentic in #8569
  • feat(quickstart): bump quickstart start periods more by @hsheth2 in #8573
  • Origin/cypress test for managing policies by @kkorchak in #8554
  • feat(ui) Show source documentation when editing entity documentation by @chriscollins3456 in #8516
  • fix(ingest): handle redaction of configs with int keys by @hsheth2 in #8545
  • fix(ingest/snowflake): maintain qualified name casing, do not lowercase by @mayurinehate in #8574
  • feat(docs): add github repo links to readme and docs by @yoonhyejin in #8422
  • feat(ebean): Add metric in ebean aspect DAO for failed tries, as well as failed operation… by @iprentic in #8576
  • refactor(search) Use search across multiple-entities API, deprecate Aggregator classes by @iprentic in #8498
  • feat(siblings): dont show multiple platform icons if the siblings are ghost nodes by @gabe-lyons in #8543
  • docs(lineage): Add description to make_lineage_mce by @eboneil in #8596
  • doc(ingest/log): failure log at pipeline level document by @anshbansal in #8591
  • Dataset ownership test by @kkorchak in #8583
  • doc(release): release notes for 0.2.10 by @anshbansal in #8599
  • docs(release): fix typo by @anshbansal in #8600
  • feat(ui): apply views to: domains, containers, terms by @eboneil in #8572
  • feat(search): embedded view dropdown by @joshuaeilers in #8598
  • fix(ingest/file): remove entity_type_counts and aspect_counts by @hsheth2 in #8586
  • fix(ingest): use hive pure_sasl variant by @hsheth2 in #8570
  • Feat(ingest/ldap)fix list index out of range error by @alplatonov in https://githu...
Read more

v0.10.5

02 Aug 03:58
4f9fc67
Compare
Choose a tag to compare

Release Highlights

NEW: Unified Search and Browse Experience

It’s here, it’s here! We are incredibly excited to roll out our re-designed, streamlined Search and Browse experience. End-users now have a one-stop-shop to search for specific data entities and browse across systems, making it easier than ever to find the most relevant and meaningful resources within DataHub.

Checkout the screenshot below and get a full walk-through in this video!

CleanShot 2023-08-03 at 14 47 55@2x

User Experience

  • Column-Level Lineage (CLL) visualization update: you can now visualize CLL relationships through DataJobs (i.e. Airflow DAGs)
  • Unique Glossary Terms: We now prevent creating duplicate Glossary Term names within a Term Group
  • Domains: You can now configure the Documentation tab to be the default landing page within a Domain
  • Formatting updates to Row Count to make large numbers more human readable (ie. 3283337 > 3.2M)
  • Stats Tab: Y-axis scale now dynamically set to reflect the minimum & maximum values, improving readability

Metadata ingestion

Ingestion Enhancements:

  • BigQuery: Set platform_instance using project_id
  • PowerBI: Ingest datasets not used in visualizations (tiles/pages
  • Kafka Connect: Ability to set platform_instance
  • Nifi: Support for basic auth
  • Presto on Hive: Extract all table properties from Hive Metastore
  • Elasticsearch: Support for basic profiling
  • Add advanced configuration for LDAP manager ingestion

Lineage Improvements:

  • Schema-aware SQL parsing to derive column-level lineage
  • Column-level lineage support for BigQuery, Tableau, and Snowflake View definitions
  • Snowflake: Extract Snowpipe S3 lineage

Developer Experience

  • Fine-grained ownership policies
  • PATCH support for DataJob Inputs/Outputs
  • New endpoints to extract size of time-series indices and truncate/cleanup time-series indices in Elasticsearch; support for bulk-deletes
  • Initial support for exception reporting via Sentry
  • New OpenAPI endpoint to get Task Status
  • SDK: Easily generate container URNs

Docs

  • Improvements to our File-Based Lineage doc, specifically focused on Fine-Grained Lineage config components (link)
  • Code examples of how to manage Posts within DataHub (link)
  • Guide to generating custom browse paths for the new search experience (link)

What's Changed

  • refractor(classification): datahub classifier init by @mayurinehate in #8193
  • fix(glue): fix typo in reported warning, report with flow_urn by @mayurinehate in #8138
  • fix(ingest/delta-lake): fix CI issues due to delta lake version bump by @mayurinehate in #8215
  • Upgrade kafka and its dependencies to 3.4 in docker compose by @jinlintt in #8161
  • chore(release): update default cli for managed ingestion by @pedro93 in #8226
  • fix(ownership): Corrects graphQL resolver for entity operations by @pedro93 in #8219
  • fix(cli/quickstart): handle docker hangs gracefully by @hsheth2 in #8211
  • fix(cli): make quickstart robust to docker race conditions by @hsheth2 in #8233
  • fix(search): tag/term should filter for both entity and field level by @anshbansal in #7881
  • docs(tests): document test eval endpoint by @anshbansal in #8227
  • feat(ingest/bigquery_v2): enable platform instance using project id by @asikowitz in #8216
  • feat(stats): make rowcount more human readable by @joshuaeilers in #8232
  • docs(es): Update aws deploy docs to correct ElasticSearch version by @iprentic in #8240
  • feat(sdk): support patches as MCPs in file source by @hsheth2 in #8220
  • fix(apiAuth): add resources where applicable and update docs by @RyanHolstien in #8234
  • feat(patch): support datajob input output by @RyanHolstien in #8190
  • feat(ingest/unity): Set external url for containers and datasets by @asikowitz in #8238
  • docs(airflow): add docs on custom operators by @matthew-coudert-cko in #7913
  • chore(release): update datahub upgrade docs by @pedro93 in #8228
  • fix(ingestion/tableau): Remove unused field documentViewId by @mohdsiddique in #8225
  • feat(ui): create fast path for immediate processing of ui sourced changes by @RyanHolstien in #8200
  • fix(ingest/druid) Handling gracefully if no table returned in a schema by @treff7es in #8203
  • fix(kafka-setup): bump kafka version by @david-leifker in #8245
  • feat(ingestion/powerbi): Ingest datasets not used in PowerBI visualization(tiles/pages) by @mohdsiddique in #8212
  • fix(sdk/dataflow): deprecate cluster and use env and platform_instance instead by @shubhamjagtap639 in #8201
  • fix(ingest): pass platform correctly to browse path v2 helper by @asikowitz in #8244
  • feat(search): Supporting Aggregations for hasX fields by @jjoyce0510 in #8241
  • fix(ingest): Call validator on the base urn as well as aspect components when ingesting by @iprentic in #8250
  • docs(website): adjust markprompt z-index so it's not covered by nav by @jeffmerrick in #8255
  • fix(patch): Fix exception when using default patch for patching missing aspects by @jjoyce0510 in #8221
  • fix(custom-search): revert underscore as quoted by @david-leifker in #8163
  • chore(ci): add back optional static sleep for tests by @anshbansal in #8258
  • chore(checkbox): darken all checkboxes by @joshuaeilers in #8248
  • chore(assertions): catch any exception on assertion delete by @joshuaeilers in #8247
  • feat(opensearch): Rollover usage events at a file size rather than time-based manner by @iprentic in #8182
  • fix(ingest/okta): Set default of okta_profile_to_username_attr to email by @asikowitz in #8263
  • feat(ui) Update Search & Browse to be a unified experience by @chriscollins3456 in #8235
  • fix(ingest/tableau): split table columns query from datasources query by @mayurinehate in #8217
  • fix(ingest/okta): Set default of okta connector to match OIDC defaults by @anshbansal in #8272
  • feat(elasticsearch): Add endpoint for getting the size of timeseries indices by @iprentic in #8265
  • feat(ingest/delete-cli): Add configurable batch size; update docs by @asikowitz in #8274
  • fix aggregation sorting in browsev2 sidebar by @joshuaeilers in #8276
  • Support de-selecting browse paths by @joshuaeilers in #8242
  • feat(cli): Initial support for sending exceptions to Sentry by @treff7es in #7172
  • fix(ingestion/powerbi): use admin api resolver to fetch modified workspaces by @mohdsiddique in #8273
  • fix: dbt-athena types mapping for complex types by @svdimchenko in #8264
  • feat(graphql) Prevent duplicate glossary term names within a group by @chriscollins3456 in #8187
  • Add retries to JavaEntityClient:deleteReferencesTo by @joshuaeilers in #8268
  • feat(ingest): Create zero usage aspects by @asikowitz in #8205
  • fix(docs) Update Chrome extension docs to reflect current reality by @chriscollins3456 in #8284
  • refactor(validations): Add URL-based Routing to Dataset Validations Tab by @jjoyce0510 in #8254
  • fix(metadata-io): retry transactions on serialization errors when using a PostgreSQL database by @Masterchen09 in #8278
  • docs(ingest/lineage): Update fine grained file lineage docs by @eboneil in https://github...
Read more

v0.10.4

12 Jun 14:22
f2c66fd
Compare
Choose a tag to compare

Release Highlights

User Experience

  • You can now create and assign Custom Ownership types within DataHub; plus, we now display the owner type on an Entity Page
    ownershiptype-displayed

  • Various bug fixes to Column Level Lineage visualization

Metadata ingestion

  • You can now define column-level lineage (aka fine-grained lineage) via our file-based lineage source
  • Looker: Ingest Looks that are not part of a Dashboard
  • Glue: Error reporting now includes lineage failures
  • BigQuery: Now support deduplicating LogEntries based on insertId, timestamp, and logName

Docs

  • CSV Enricher: improvements to sample CSV and recipe
  • Guide for changing default DataHub credentials
  • Updated guide to apply time-based filters on Lineage

What's Changed

  • ci(ingest/kafka): improve kafka integration test reliability by @hsheth2 in #8085
  • fix(ingest/bigquery): Deduplicate LogEntries based on insertId, timestamp, logName by @asikowitz in #8132
  • feat(ingest/glue): report glue job lineage failures, update doc by @mayurinehate in #8126
  • feat(lineage source): add fine grained lineage support by @anshbansal in #7904
  • docs(glue): fix broken link by @mayurinehate in #8135
  • feat(custom ownership): Adds Custom ownership types as a top level entity by @pedro93 in #8045
  • Update updating-datahub.md for v0.10.3 release by @iprentic in #8139
  • feat: add dbt-athena adapter support for column types mapping by @svdimchenko in #8116
  • docs(csv-enricher): add example csv file & recipe by @gabe-lyons in #8141
  • chore(ci): update base requirements file by @anshbansal in #8144
  • fix(ingest/s3): Path spec aware folder traversal by @treff7es in #8095
  • fix(ui) Fix selecting columns in Lineage tab for CLL by @chriscollins3456 in #8129
  • feat(search): adding support for _entityType filter in the application layer + frontend by @gabe-lyons in #8102
  • docs(ingest/nifi): fix broken links by @mayurinehate in #8143
  • fix(scroll): fix scroll cache key for hazelcast by @RyanHolstien in #8149
  • chore(json): fix json vulnerability by @RyanHolstien in #8150
  • fix(ingest/json-schema): handle property inheritance in unions by @hsheth2 in #8121
  • chore(log): fix log as error instead of info by @anshbansal in #8146
  • fix(lineagecounts) Include entities that are filtered out due to sibling logic in the filtered count of lineage counts by @iprentic in #8152
  • fix(stats): display consistent query count on stats tab by @joshuaeilers in #8151
  • fix(ingest): remove original_table_name logic in sql source by @hsheth2 in #8130
  • feat(ingest): add more fail-safes to stateful ingestion by @hsheth2 in #8111
  • feat(ingest/snowflake): support for more operation types by @mayurinehate in #8158
  • fix(ui) Show Entities first on Domain pages again by @chriscollins3456 in #8159
  • fix(ingest/nifi): allow nifi site url with context path by @mayurinehate in #8156
  • feat(ingest): Create Browse Paths V2 under flag by @asikowitz in #8120
  • fix(ingestion/looker): set project-name for imported_projects views by @mohdsiddique in #8086
  • fix(docs): Fix ownership type typos by @pedro93 in #8155
  • docs(townhall) feb and march town hall agenda and recording by @maggiehays in #7676
  • feat(ingest/unity): Add qualified name to dataset properties by @asikowitz in #8164
  • feat(ingest/bigquery_v2): enable platform instance using project id by @Khurzak in #8142
  • feat(ingest/snowflake): Deprecate legacy lineage and optimize query history joins by @asikowitz in #8176
  • fix(ingest/kafka): Fixing error printing in Kafka properties get call by @treff7es in #8145
  • fix(ingest/snowflake): set use_quoted_name to profile lowercase tables by @mayurinehate in #8168
  • feat(classification): support for regex based custom infotypes by @mayurinehate in #8177
  • fix(restli): update base client retry logic by @david-leifker in #8172
  • fix(ingest): Fix modeldocgen; bump feast to relax pyarrow constraint by @asikowitz in #8178
  • refactor(ci): move from sleep to kafka lag based testing by @shirshanka in #8094
  • docs(lineage): document timestamp filtering in lineage feature by @iprentic in #8174
  • build(ingest/feast): Pin feast to minor version by @asikowitz in #8180
  • feat(ingest/snowflake): Okta OAuth support; update docs by @asikowitz in #8157
  • feat(ingest/presto-on-hive): add support for extra properties and merge property capabilities by @treff7es in #8147
  • docs(managed datahub): release notes for v0.2.8 by @anshbansal in #8185
  • fix(nocode): fix DeleteLegacyGraphRelationshipsStep for Elasticsearch by @david-leifker in #8181
  • feat(docker):Add the jattach tool to the docker container(#7538) by @yangjiandan in #8040
  • refactor: Return original exception as caused by by @Jorricks in #7722
  • docs(ingest) Add MetadataChangeProposalWrapper import to example code by @iprentic in #8175
  • fix(ingest/kafka): Better error handling around topic and topic description extraction by @asikowitz in #8183
  • fix(vulnerabilities)/vulnerabilities_fixes_datahub (#8075) by @david-leifker in #8189
  • fix: add dedicated guide on changing default credentials by @yoonhyejin in #8153
  • feat(classification): configurable minimum values threshold by @mayurinehate in #8186
  • fix(ingestion/looker): ingest looks not part of dashboard by @mohdsiddique in #8140
  • fix(ingest/profiling): only apply monkeypatches once when profiling by @hsheth2 in #8160
  • docs(tableau): site config is required for tableau cloud / tableau online by @mohdsiddique in #8041
  • fix(ingest/bigquery): Swap log order to avoid confusion by @asikowitz in #8197
  • fix(ingest/redshift): Adding env parameter where it was missing for urn generation by @treff7es in #8199
  • revert(ingest/bigquery): Do not emit DataPlatformInstance; remove references to platform_instance by @asikowitz in #8196
  • docs(managed datahub): add docs link to v0.2.8 by @anshbansal in #8202
  • Add combined health check endpoint which can check multiple components by @iprentic in #8191
  • chore(cp-schema-registry): bump minor version by @david-leifker in #8192
  • feat(ingest): Produce browse paths v2 on demand and with platform instance by @asikowitz in #8173

New Contributors

Full Changelog: v0.10.3...v0.10.4

v0.10.3

25 May 20:20
1478d70
Compare
Choose a tag to compare

Release Highlights

User Experience

  • Define Data Products via YAML and manage associated entities within a Domain
  • Search experience: quickly apply a filter at time of search
  • Form-based PowerBI ingestion

Developer Experience

  • Progress toward Removing Confluent Schema Registry requirement -- Helm & Quickstart simplifications to follow
    • NOTE: this will only work for new deployments of DataHub; If you have already deployed DataHub with Confluent Schema Registry, you will not be able to disable it
  • Delete CLI - correctly handles deleting timeseries aspects
  • Ongoing improvements to Quickstart stability
  • Support entity types filter in get_urns_by_filter
  • Search customization
    • regex based query matching
    • full control over scoring functions (useable on any document field, i.e. tags, deprecated flags, etc)
    • enable/disable fuzzy, prefix, exact match queries

Ingestion

  • BigQuery - Improve ingestion disk usage & speed; extract dataset usage from Views
  • Unity Catalog - Capture create/last modified timestamps; extract usage; data profiling support
  • PowerBI - Update workspace concept mapping; support modified_since, extract_dataset_schema, and more
  • Superset – support stateful ingestion
  • Business Glossary – Simplify ingestion source
  • Kafka – Add description in dataset properties
  • S3 – Support stateful ingestion & last_updated
  • CSV Enricher – Support updating more types
  • PII Classification - Configurable sample size
  • Nifi - Support Kerberos authentication

What's Changed

  • fix(ingest/bigquery): Add to lineage, not overwrite, when using sql parser by @asikowitz in #7814
  • fix(ingest/bigquery): Enable lineage and usage ingestion without tables by @asikowitz in #7820
  • fix(ingest/bigquery): Do not query columns when not ingesting tables or views by @asikowitz in #7823
  • fix(ingest/bigquery): update usage query, remove erroneous init by @mayurinehate in #7811
  • fix(ingest/bigquery): Handle null values from usage aggregation by @asikowitz in #7827
  • perf(ingest/bigquery): Improve bigquery usage disk usage and speed by @asikowitz in #7825
  • fix(cli): use correct ingestion image in script by @hsheth2 in #7826
  • fix(release): prevent republish of images on release edits by @RyanHolstien in #7828
  • feat(): finish populating the entity registry by @hsheth2 in #7818
  • fix(ui) Fix 404 page routing bug by @chriscollins3456 in #7824
  • feat(ui): Support PowerBI Ingestion via UI form by @jjoyce0510 in #7817
  • fix(ingest/snowflake): fix column name in snowflake optimised lineage by @mayurinehate in #7834
  • feat(ingest/unity): capture create/lastModified timestamps by @hsheth2 in #7819
  • fix(test): fix spark lineage test by @david-leifker in #7829
  • docs(): add markprompt help chat by @jeffmerrick in #7837
  • Update DataJobInputOutput.pdl to express that CLL fields are not shown in the UI right now by @gabe-lyons in #7830
  • feat(cli): improve quickstart stability by @hsheth2 in #7839
  • chore(ci): regular upgrade base requirements.txt by @anshbansal in #7821
  • feat(timeseries): Support sorting timeseries aspects by non-timestampMillis field + fix operations resolver by @jjoyce0510 in #7840
  • doc(ingestion/tableau): Fix rendering ingestion quickstart guide by @mohdsiddique in #7808
  • fix(ingest): pin sqlparse version by @hsheth2 in #7847
  • feat(urn): Add a validator when creating an URN that it is no longer than the li… by @iprentic in #7836
  • chore(ingest): bug fix in sqlparse pin by @hsheth2 in #7848
  • feat: enriching guide on creating dataset by @yoonhyejin in #7777
  • feat(docs): consolidate api guides by @yoonhyejin in #7857
  • fix(ingest/salesforce): use report timestamp for operations by @hsheth2 in #7838
  • chore(ci): fix CI failing due to lint by @anshbansal in #7863
  • fix(mcl): fix improper pass by reference by @RyanHolstien in #7860
  • feat(urn) Add validator to reject URNs which contain the character we plan to u… by @iprentic in #7859
  • feat(elasticsearch): Add servlet which provides an endpoint for a healthcheck on the ES cl… by @iprentic in #7799
  • fix(ui) Add UI fixes and design tweaks to AutoComplete by @chriscollins3456 in #7845
  • fix(ui) Get all entity assertions in chrome extension by @chriscollins3456 in #7849
  • refactor(platform): Refactoring ES Utils, adding EXISTS condition support to Filter Criterion by @jjoyce0510 in #7832
  • chore(ui): change background color to transparent for avatar with photoUrl by @hieunt-itfoss in #7527
  • refactor(ingest): Add helper DataHubGraph methods by @asikowitz in #7851
  • fix(ui) Disable cache on Domain and Glossary Related Entities pages by @chriscollins3456 in #7867
  • fix(cache): Fix cache key serialization in search service by @pedro93 in #7858
  • docs(ingest): update dbt and aws docs by @hsheth2 in #7870
  • docs(ingest): fix CorpGroup example by @hsheth2 in #7816
  • docs(ingest/powerbi): update workspace concept mapping by @eeepmb in #7835
  • feat(ingest/powerbi): support modified_since, extract_dataset_schema and many more by @aezomz in #7519
  • Remove usages of commons-text library lower than 1.10.0 by @iprentic in #7850
  • feat(glue): allow resource links to be ignored by @YusufMahtab in #7639
  • feat(ingestion): lookml refinement support by @mohdsiddique in #7781
  • feat(ingest/unity): Ingest ownership for containers; lookup service principal display names by @asikowitz in #7869
  • Logging and test models fixes by @david-leifker in #7884
  • feat(model) Add ContainerPath aspect model by @chriscollins3456 in #7774
  • bug(7882): run kafka-configs.sh on DataHubUpgradeHistory_v1 to make sure the retention.ms is set to infinite by @jinlintt in #7883
  • fix: refactor toc by @yoonhyejin in #7862
  • feat(cli): Modifies ingest-sample-data command to use DataHub url & token based on config by @pedro93 in #7896
  • feat(ingest/snowflake): optionally emit all upstreams irrespective of recipe pattern by @mayurinehate in #7842
  • fix(ingestion/tableau): backward compatibility with version 2021.1 an… by @mayurinehate in #7864
  • fix(ingest/dbt): ensure dbt shows view properties by @hsheth2 in #7872
  • docs(airflow): add debug guide on url generation by @hsheth2 in #7885
  • feat(sdk): support entity types filter in get_urns_by_filter by @hsheth2 in #7902
  • fix(ingest/snowflake): fix optimised lineage query, filter temporary … by @mayurinehate in #7894
  • fix(ingest/bigquery): fix handling of time decorator offset queries by @mayurinehate in #7843
  • fix(ingest): fix minor bug + protective dep requirements by @hsheth2 in #7861
  • fix(cli): remove duplicate labels from quickstart files by @hsheth2 in #7886
  • Revert "feat(cli): Modifies ingest-sample-data command to use DataHub… by @pedro93 in #7899
  • feat(sdk): add DataHubGraph.get_entity_semityped method by @hsheth2 in #7905
  • test(ingest/biz-glossary): add test for enable_auto_id by @hsheth2 in #7911
  • feat(ingest): add GCS ingestion source by @mayurinehate in #7903
  • [bugfix] Fix remote file ingestion...
Read more

DataHub v0.10.2

13 Apr 23:26
4ec280e
Compare
Choose a tag to compare

Known Issues

  • Postgresql: In release v0.10.1 the default value for max_threads was increased in the CLI from 1 to 15. This creates an issue with Postgresql transactions. The recommended workaround is to decrease the max_threads in your ingestion recipes to 1 if running Postgresql for the GMS backend.
  • BigQuery: BigQuery connector depends on a bad version of SQLParse, which manifest as str object is not callable error. This has since been fixed in CLI release version v0.10.2.2.

Release Highlights

Metadata Ingestion

New

  • [redshift] Redshift Combining Usage and Metadata Extraction
  • [bigquery] Cross-Project Usage Support (using File System)
  • [snowflake] Push down Lineage Extraction to Snowflake Access History API
  • [azure-ad] Support stateful ingestion - Automatically remove groups and users when they are removed in Azure.
  • [okta] Support stateful ingestion - Automatically remove groups and users when they are removed in Okta.
  • [tableau] Extract lineage from CSQL queries in Tableau ingestion
  • [snowflake] Better error message on key pair authentication
  • [sdk] Support executing GraphQL Queries via DataHubGraph
  • [unity] Support extracting ownership
  • [postgres] Support extracting metadata from all databases in a single recipe

Bug Fixes

  • [bigquery] Capture all operation types when ingesting operational stats
  • [bigquery] Fix and refactor exported audit logs query
  • [redshift] Fix SQL for extracting lineage from insert queries

User Experience

New

  • Auto-Complete UX Refresh - Quickly filter search results inside autocomplete experience
  • View: Support views on the Auto-Complete Search Bar

Bug Fixes

  • Fix bug where Tag names do not render properly in search previews
  • Fix bug where Tag color does not render properly in search autocomplete
  • Fix bug when adding Tags and Glossary Terms to nested schema fields
  • Fix bug where DataHub would redirect you when clicking to navigate back home
  • Fix bug where Metadata Tests results did not show if they were all passing

Documentation

Developer Experience

  • Add performance testing framework for BigQuery usage

What's Changed

  • fix(cli): allow usage without kafka by @hsheth2 in #7677
  • test(elasticsearch): Add unit test for timestamp-based lineage feature by @iprentic in #7661
  • feat(docs-website): add docs on creating users and groups by @yoonhyejin in #7574
  • chore(ci): add coverage code for python by @anshbansal in #7681
  • doc(release): managed datahub v0.2.4 release notes by @anshbansal in #7679
  • refactor(ingest/bigquery): add inline comments + refactor in table name parsing by @mayurinehate in #7609
  • fix(ingest/looker): skip empty user ids for usage by @hsheth2 in #7686
  • fix(ingest/dbt): enable incremental lineage by default by @hsheth2 in #7674
  • fix(ingest/bigquery): Fix BigQueryTableType enum accesses by @asikowitz in #7685
  • fix(ingest/looker): correct looker/lookml capability reports by @hsheth2 in #7683
  • feat(ingest/looker): enable looker usage ingestion by default by @hsheth2 in #7684
  • doc(freshness): add faq for dataset freshness by @anshbansal in #7693
  • chore(lint): fix lint in looker by @anshbansal in #7695
  • fix(ingest/bigquery): quote string constants in query by @mayurinehate in #7694
  • feat(ui) Update auto-complete functionality and design by @chriscollins3456 in #7515
  • fix(ui) Update Looker/Lookml forms to set client id and deploy key as Secrets by @chriscollins3456 in #7479
  • perf(ingest): Improve FileBackedDict iteration performance; minor refactoring by @asikowitz in #7689
  • feat(quickstart): move quickstart back to master by @hsheth2 in #7697
  • test(ingest/dbt): add test for column meta match by @hsheth2 in #7673
  • feat(ingest/postgres): support extracting metadata from all databases in single recipe by @mayurinehate in #7581
  • docs(): generate docs for our Python SDK by @hsheth2 in #7612
  • fix(ingest/redshift): Lineage query fix to work with the latest redshift by @treff7es in #7698
  • feat(ingestion): azure-ad stateful ingestion by @mohdsiddique in #7701
  • chore(ingest): formatting + cleanup MCPW usages by @hsheth2 in #7706
  • test(ingest/bigquery): Add performance testing framework for bigquery usage by @asikowitz in #7690
  • fix(docs): Fixing timeseries delete doc until code path is fixed by @jjoyce0510 in #7711
  • docs: add concept section by @yoonhyejin in #7655
  • JWT authenticator with asymmetric PublicKey verification for JWT token. by @syedzoherer in #6495
  • fix(ingestion): fix AssertionError in base_transformer by @sgomezvillamor in #7702
  • feat(docs): support inlining code snippets from files by @hsheth2 in #7712
  • feat(ingestion) Allow for ingestion to read files remotely by @xiphl in #7552
  • feat: add pre-commit by @yoonhyejin in #7680
  • docs(okta): add how to use email in urns by @anshbansal in #7708
  • feat(ingest/snowflake): hide host_port from snowflake docs by @hsheth2 in #7717
  • feat(ingest/bigquery): Capture all operation types when ingesting operational stats by @asikowitz in #7723
  • doc(redshift) - Adding Redshift ingestion quickstart guide by @treff7es in #7700
  • refactor(ingest): Minor cleanup of File, CsvEnricher, BusinessGlossary, and FileLineage sources by @asikowitz in #7718
  • feat(ingest/lookml): support views with derived_table.explore_source by @hsheth2 in #7704
  • fix(ci): Fixing broken Domains Test by @jjoyce0510 in #7746
  • feat(ingest/dbt): include dbt unique_id in properties by @hsheth2 in #7737
  • docs(airflow): update with information for new plugin by @anshbansal in #7732
  • chore(ingest): change kafka connect mapped ports by @hsheth2 in #7728
  • feat(docs): clear up source configs by @hsheth2 in #7720
  • feat(ingest): emit state payloads as soft-deleted by @hsheth2 in #7714
  • fix(sdk): remove rest emitter to graph cache in CorpGroup by @bossenti in #7743
  • refactor(ingest): Use sqlite.Row row_factory for FileBackedCollections by @asikowitz in #7739
  • refactor(ingest/bigquery): Standardize audit log parsing and make TopKDict a DefaultDict by @asikowitz in #7738
  • doc(ingestion): tableau quick ingestion guide by @mohdsiddique in #7682
  • docs(search): Add example search for finding tables without the name field by @iprentic in #7647
  • feat(ingest/dbt): update subtypes for dbt by @hsheth2 in #7750
  • feat(snowflake): better error message on key pair authentication by @anshbansal in #7734
  • feat(sdk): fix ownership emission for groups by @hsheth2 in #7751
  • fix(TestResults UI):show non-failing TestResult by @blankon123 in #7747
  • fix(ingest/bigquery): fix and refractor exported audit logs query by @mayurinehate in #7699
  • fix(ingest/demo-data): fix bug in path type by @hsheth2 in https://github.com/datahub-project...
Read more

DataHub v0.10.1

24 Mar 00:38
864ac2d
Compare
Choose a tag to compare

Known Issues

CLI

  • BigQuery: Table and Column Level profile broken due to bad assumption introduced in this version. Please use an alternate version if you are using the BigQuery Profiling feature.

ElasticSearch

7.9 and below clusters are no longer supported with this release due to lack of case sensitivity support in term queries

Release Highlights

User Experience

  • The Queries Tab has a new look - supports manually adding and annotating queries directly from the UI, making it easier to share trusted SQL logic with others
  • Glossary Terms now shows “Contained by" and "Inherited by" relationships
  • Resolved issues with Download to CSV for large volumes of entities
  • Update to the Analytics tab - view Monthly Active users to keep track of DataHub adoption and activity within your organization
  • Ongoing UI optimizations focused on improve navigation experience

Metadata Ingestion

BigQuery

  • Improvements to memory usage during metadata extraction
  • Ingestion now captures Dataset Labels
  • Emit cross-project usage

PowerBI

  • Support for Platform Instance and uniquely identify multiple instances of the same Platform
  • Support for PowerBI <> (Redshift, BigQuery) lineage extraction
  • Extract entity descriptions

Miscellaneous

  • DataHub Integrations Catalog to quickly filter and search for supported integrations
  • Kafka Connect - support for stateful ingestion & lowercasing URNs
  • Snowflake: improvements to memory usage during metadata extraction
  • Postgres: supports estimated row counts during profiling
  • Fix to dbt ingestion to address inconsistent upper/lower casing
  • S3 ingestion now supports path_specs of multiple buckets in the same recipe
  • Looker: Upgrade Looker API from 3.1 to 4.0
  • Great Expectations: support for lowercasing URNs
  • Tableau: Support for Project Path & Containers; ingestion more resilient to timeout exceptions

Developer Experience

Miscellaneous

  • Neo4j support for lineage time filter
  • Metadata model support for JSON schemas stored in Files, Directories, and Kafka Schema Registry
  • Timeline API now supports Glossary Terms
  • Improvements to startup time for DataHub CLI

API Docs & Guides

  • Table of contents to understand DataHub APIs at a glance
  • Guides:
    • Add Tags, Terms, Owners to entities
    • Create datasets
    • Manage Lineage

Search Improvements

  • searchAcrossEntities/Lineage improvements
  • support searchAfter
  • advanced query, identity autocomplete, exact match weight

Breaking Changes

Lineage Graph UI

  • Previously, DataHub would display Nodes in Lineage Viz even for URNs that do not technically exist (do not have any aspects defined). Now, those nodes are filtered out. This means that lineage which previously existed may not appear anymore in Lineage Graph. This change was done to improve the correctness and consistency of the DataHub experience. If you have feedback, feel free to reach out to the core team. To fix this issue, simply produce "DatasetKey" aspects for any URNs that you'd like to show in Lineage graph.

What's Changed

Read more

DataHub v0.10.0

07 Feb 21:16
cf1e627
Compare
Choose a tag to compare

Release Highlights

Potential Downtime

This release introduces substantial improvements to search functionality which require reindexing indices.

During the reindexing:

  • a system-update job will set indices to read-only and create a backup/clone of each index
  • new components will be prevented from start-up until the reindex completes
  • Helm deployments will go into read-only mode and new ingestion runs will fail

This process can take anywhere from 5 minutes to multiple hours; as rough estimate, please expect it to take 1 hour for every 2.3 million entities. After the reindex is complete, please check your ingestion run to re-run any that did not complete.

If you are deploying containers yourself

If you're deploying the Docker containers yourself (without Helm or Docker-Compose Quickstart), then you'll need to ensure that you first run the acryldata/datahub-upgrade docker image (v0.10.0 tag) with the following environment variables enabled.

Then, run the container this with the command

docker run acryldata/datahub-upgrade:v0.10.0 -u SystemUpdate

For the full set of environment variables required, check out the default docker.env provided for Docker Compose deployments.

This will run the required reindex against your elasticsearch instance, after which other DataHub components should start correctly. If you do not run the datahub-upgrade container successfully, other components in the stack will fail to start correctly.

User Experience

We have some really exciting improvements to the DataHub user experience in this release!

Improved documentation editor, contributed by @ngamanda and the Grab Team.
This work provides a much more intuitive documentation editing experience within the UI, providing “what you see is what you get” formatting & removing the need for markdown expertise.

Additionally, you can easily:

  • Add links to other entities/users within DataHub
  • embed and resize tables & images
  • toggle between font sizes and formats
  • embed syntax-highlighted code blocks

Filter lineage graphs based on time windows
You can now easily see the full lineage graph of an entity at a specific point in time. This makes it much easier to understand how interdependencies have evolved over time and to troubleshoot data issues in the past.

Improvements in Search
As noted above, we have rolled out substantial improvements to Search functionality, making it easier than ever for end-user to find the entities that matter most. This release includes:

  • Stemm & Synonyms
  • Search by full or partial URN
  • Autocomplete improvements
  • Quoted search analyzer for exact & prefix match

Metadata Ingestion

Here are some of the most notable ingestion-related improvements:

  • Redshift: You can now extract lineage information from unload queries – thanks for the contrib, @mmmeeedddsss
  • PowerBI: Ingestion now maps Workspaces to DataHub Containers – thanks for the contrib, @looppi
  • BigQuery: You can now extract lineage metadata from the Catalog API – thanks for the crontrib, @PatrickfBraz
  • Glue: Ingestion now uses table name as the human-readable name – thanks for the contrib, @danielcmessias

Developer Experience

  • This release introduces DataHub Lite - a new experimental lightweight implementation of DataHub. It is intended to enable local developer tooling use-cases such as simple access to metadata for scripts and other tools. DataHub Lite is compatible with the DataHub metadata format and all the ingestion connectors that DataHub supports. Checkout the docs here.

Breaking Changes

#7103 This should only impact users who have configured explicit non-default names for DataHub's Kafka topics. The environment variables used to configure Kafka topics for DataHub used in the kafka-setup docker image have been updated to be in-line with other DataHub components, for more info see our docs on Configuring Kafka in DataHub . They have been suffixed with _TOPIC where as now the correct suffix is _TOPIC_NAME. This change should not affect any user who is using default Kafka names.

What's Changed

  • fix(ci): only scan on master branch by @anshbansal in #7047
  • fix(ci): use trivy offline scanning by @anshbansal in #7050
  • docs(get-started) Simplify copy on Get Started landing page by @maggiehays in #7043
  • fix(ingest/kafka): fix ResourceType import error for confluent_kafka<1.9.0 by @mayurinehate in #7046
  • docs(dbt): fix indentation in dbt meta mapping docs by @jx2lee in #7045
  • fix(ingest): temporarily disable vertica tests by @hsheth2 in #7059
  • feat(editor): improve documentation editor using Remirror by @ngamanda in #6631
  • fix(bootstrap): add EDIT_LINEAGE privilege to some default policies by @aditya-radhakrishnan in #7060
  • feat(ingest): add entity registry in codegen by @hsheth2 in #6984
  • feat(ingest): extract powerbi endorsements to tags by @looppi in #6638
  • feat(ingestion): pull metabase database, schema names from raw query and api by @remisalmon in #7039
  • fix(ingest): support multiple entity_registry sections by @hsheth2 in #7066
  • ci(ingest): add flag to skip tests but run codegen during release by @hsheth2 in #7067
  • fix(ingest): preserve dbt column name casing by @hsheth2 in #7063
  • fix(ingest/tableau): fix node limit exceeded error for workbooks query by @mayurinehate in #7068
  • fix(build/airflow): Fixing gradlew path by @treff7es in #7069
  • feat(ingest): support snapshots in dbt and dbt-cloud by @hsheth2 in #7062
  • fix(ui) Fix duplicate schema field rendering with siblings by @chriscollins3456 in #7057
  • refactor(ingest/athena): Replace s3_staging_dir parameter in Athena source with query_result_location by @bossenti in #7044
  • feat(ingest): fix handling of unions with aliases in post restli conversion by @hsheth2 in #7058
  • fix(ui) Make checkboxes in ingestion forms easier to see by @chriscollins3456 in #7061
  • fix(ingest): support git clone of non-github repos by @hsheth2 in #7065
  • feat(ingest): reporting revamp, part 1 by @hsheth2 in #7031
  • fix(secret-service): fix default encrypt key by @david-leifker in #7074
  • feat(datahub-lite): introduces a new experimental lightweight impleme… by @shirshanka in #7052
  • feat(datahub-lite): adding tab completion, small serialization fixes by @shirshanka in #7079
  • docs: add docs for managed DataHub v0.1.72 by @anshbansal in #7070
  • docs(readme): add inovex as adopter by @DSchmidtDev in #7077
  • docs: add warning about clearing cookies for login by @anshbansal in #7084
  • feat(cache): add hazelcast distributed cache option by @RyanHolstien in #6645
  • docs(datahub-lite): small improvement for zsh tab completion by @shirshanka in #7085
  • fix(ingest/bigquery): clear stateful ingestion correctly by @hsheth2 in #7075
  • fix(graphql): Return with appropriate status code instead of stacktrace by @szalai1 in #7086
  • fix(sso): Clear cookies on SSO redirect error by @aditya-radhakrishnan in #7088
  • fix(docs): add missing mutation literal by @ruedigerblock in #7082
  • fix(ui): display the correct access token expiry in AccessTokenModal by @ngamanda in #7078
  • fix(cli/lite): fix datahub lite serve command by @hsheth2 in #7089
  • fix(profiling): Fix syntax for APPROX_COUNT_DISTINCT on bigquery and snowflake by @feljen in #7087
  • fix(ingest): fix logic error of google protobuf wrapper type. by @wngus606 in #7076
  • feat(ui): Documentation Editor Improvements by @jjoyce0510 in #7072
  • fix(uri): marks uri field as deprecated, removes problem code, and adds coercer for usages of URI typeref by @RyanHolstien in #7093
  • fix(build): postgres docker secret by @david-leifker in https://github.com/datahub-pr...
Read more