Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(tableau): set ingestion stage report and pertimers #12234

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

sgomezvillamor
Copy link
Contributor

@sgomezvillamor sgomezvillamor commented Dec 27, 2024

Locally tested:

...
{'ingestion_stage': 'End at 2024-12-27 12:56:56.379116+00:00',
 'ingestion_stage_durations': {'Ingesting Tableau Site: 9f087e55-dc7b-42cb-a5cb-08fd1b2f9e2c acryldatatableaupartnershipportal at 2024-12-27 12:55:43.160127+00:00': 73.22},
...

 'extract_usage_stats_timer': {'9f087e55-dc7b-42cb-a5cb-08fd1b2f9e2c': 0.39},
 'fetch_groups_timer': {},
 'populate_database_server_hostname_map_timer': {},
 'populate_projects_registry_timer': {'9f087e55-dc7b-42cb-a5cb-08fd1b2f9e2c': 1.51},
 'emit_workbooks_timer': {'9f087e55-dc7b-42cb-a5cb-08fd1b2f9e2c': 12.25},
 'emit_sheets_timer': {'9f087e55-dc7b-42cb-a5cb-08fd1b2f9e2c': 4.59},
 'emit_dashboards_timer': {'9f087e55-dc7b-42cb-a5cb-08fd1b2f9e2c': 1.42},
 'emit_embedded_datasources_timer': {'9f087e55-dc7b-42cb-a5cb-08fd1b2f9e2c': 38.28},
 'emit_published_datasources_timer': {'9f087e55-dc7b-42cb-a5cb-08fd1b2f9e2c': 9.37},
 'emit_custom_sql_datasources_timer': {'9f087e55-dc7b-42cb-a5cb-08fd1b2f9e2c': 0.49},
 'emit_upstream_tables_timer': {'9f087e55-dc7b-42cb-a5cb-08fd1b2f9e2c': 1.82},

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

emit_custom_sql_datasources_timer: Dict[str, float] = dataclass_field(
default_factory=TopKDict
)
emit_upstream_tables_timer: Dict[str, float] = dataclass_field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a huge fan of this - does it make sense to have a SiteReport type, and then the TableauSourceReport has a dict[str (site id), SiteReport]?

how many sites are there reasonably going to be? the main limitation is that we don't want the report size to grow in an unbounded way

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually maybe this is ok - we really only care about the ones that take a long time, right

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 27, 2024
yield from self.emit_dashboards()
with PerfTimer() as timer:
yield from self.emit_dashboards()
self.report.emit_dashboards_timer[self.site_id] = round(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PerfTimer objects can go directly into the report, and will get automatically formatted nicely - it implements as_obj, which is used here

if isinstance(some_val, SupportsAsObj):

so we could do something like with self.report.emit_dashboards_timer.setdefault(self.site_id, PerfTimer()):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although we might need to make some tweaks to TopKDict to make that work, so may not be worth it

@@ -3457,33 +3493,88 @@ def _create_workbook_properties(
return {"permissions": json.dumps(groups)} if len(groups) > 0 else None

def ingest_tableau_site(self):
self.report.report_ingestion_stage_start(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually I want to make these stages context managers - explicitly reporting start/end feels pretty error-prone

emit_custom_sql_datasources_timer: Dict[str, float] = dataclass_field(
default_factory=TopKDict
)
emit_upstream_tables_timer: Dict[str, float] = dataclass_field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually maybe this is ok - we really only care about the ones that take a long time, right

self._fetch_groups()
with PerfTimer() as timer:
self._fetch_groups()
self.report.fetch_groups_timer[self.site_id] = round(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nicer to have site names in here instead of site ids, assuming they're unique

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants