Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prototype platform connection #727

Merged
merged 14 commits into from
Nov 24, 2023
Merged

prototype platform connection #727

merged 14 commits into from
Nov 24, 2023

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Oct 31, 2023

Adds support for sending traces to the dlthub platform. Implements:

  • Support for multiple tracking modules
  • Sending full run trace after pipeline was run to platform on a separate thread
  • Adds pipeline name to pipeline trace
  • Adds Schemas to LoadInfo
  • Adds execution context to pipeline trace

@netlify
Copy link

netlify bot commented Oct 31, 2023

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit d453db3
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/655f48018c08e70008fc5e4a

@@ -87,6 +87,7 @@ class LoadPackageInfo(NamedTuple):
package_path: str
state: TLoadPackageState
schema_name: str
schema: Schema
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi i am adding the full schema to the load info here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO way better if you add TypedDict with schema content, not the object itself. this is being pickled and dumped into trace so obviously dicts work better

pass

def on_end_trace(trace: PipelineTrace, pipeline: SupportsPipeline) -> None:
_send_to_beacon(trace, None, pipeline, None)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now only send on end trace, but it would be nice to have the full progress available

@sh-rp
Copy link
Collaborator Author

sh-rp commented Oct 31, 2023

one thing that is a bit strange in the trace is, that there is this additional run step which duplicates the result of the loadinfo as stepinfo. It's not a problem really, but for one the run step is just a representation of the full trace really, and we are sending too much data around, especially given the fact that the loadinfo will always be the most verbose.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two suggestions :) also do you want to merge it or keep a branch for testing?

@@ -87,6 +87,7 @@ class LoadPackageInfo(NamedTuple):
package_path: str
state: TLoadPackageState
schema_name: str
schema: Schema
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO way better if you add TypedDict with schema content, not the object itself. this is being pickled and dumped into trace so obviously dicts work better

if pipeline.runtime_config.beacon_token and pipeline.runtime_config.beacon_url:
trace_dump = json.dumps(trace.asdict())
url = f"{pipeline.runtime_config.beacon_url}/pipeline/{pipeline.runtime_config.beacon_token}/traces"
requests.put(url, json=trace_dump)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you could reuse our telemetry thread executor to send the messages in the thread, without blocking the pipeline execution? the code is already there and I think could be repurposed.

@sh-rp sh-rp force-pushed the d#/platform_connection branch from cffa87d to 022d3f0 Compare November 1, 2023 09:25
@sh-rp sh-rp changed the title D#/platform connection prototype platform connection Nov 1, 2023
@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 1, 2023

two suggestions :) also do you want to merge it or keep a branch for testing?

For now this is only for testing i would say.

return trace


def start_trace_step(trace: PipelineTrace, step: TPipelineStep, pipeline: SupportsPipeline) -> PipelineStepTrace:
trace_step = PipelineStepTrace(uniq_id(), step, pendulum.now())
with suppress_and_warn():
TRACKING_MODULE.on_start_trace_step(trace, step, pipeline)
trace.steps.append(trace_step)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any downside to attaching the trace step on trace start? it would be nicer to have it like this when sent to the platform

@sh-rp sh-rp force-pushed the d#/platform_connection branch from 022d3f0 to b65bb63 Compare November 1, 2023 11:43
@sh-rp sh-rp force-pushed the d#/platform_connection branch from b65bb63 to 11c8cc7 Compare November 20, 2023 10:47
@sh-rp sh-rp force-pushed the d#/platform_connection branch from 11c8cc7 to f08b795 Compare November 20, 2023 10:48
change config attribute to platform_dsn
add exeuction context info to pipeline trace
add pipeline name to pipeline trace
@sh-rp sh-rp marked this pull request as ready for review November 20, 2023 14:02
@rudolfix
Copy link
Collaborator

one thing that is a bit strange in the trace is, that there is this additional run step which duplicates the result of the loadinfo as stepinfo. It's not a problem really, but for one the run step is just a representation of the full trace really, and we are sending too much data around, especially given the fact that the loadinfo will always be the most verbose.

run step is used to correlate several pipeline steps into one transaction. it makes sense IMO to send some summary information. yeah I think we repeat the LoadInfo (or the last step info) in it? then let's change it to RunInfo with just basic information. we could also use it to send full state sync (but not everyone uses run method so probably bad idea

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good! before we merge:

  1. your thread pool must register in at_exit best if you could refactor segment pool to have many pools registered there
  2. as usual some naming suggestions

dlt/common/pipeline.py Show resolved Hide resolved
from typing import Any, Callable, Dict, List, Literal, Optional, Sequence, Set, Type, TypedDict, NewType, Union, get_args


TExecInfoNames = Literal["kubernetes", "docker", "codespaces", "github_actions", "airflow", "notebook", "colab","aws_lambda","gcp_cloud_function"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you format it? or maybe we should go back to blake formatter initially in the same mode we have in verified sources?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am also for black formatter..

name: str
version: str

class TExecutionContext(TypedDict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! for open telemetry we can collect way more information that is not anonymous. but in other PR

@@ -87,6 +88,7 @@ class LoadPackageInfo(NamedTuple):
package_path: str
state: TLoadPackageState
schema_name: str
schema: TStoredSchema
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema hash. send a separate pipeine state message with all pipeline schemas

from dlt.common import json
from dlt.common.runtime import logger

_THREAD_POOL: ThreadPoolExecutor = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this temporary code? we should reuse the segment pool or at least somehow add this pool to at_exit handler otherwise messages will not go out at the end.

maybe extract this sending pool to common module that creates a pool and registers it in exit handler?

also implementation below does not have a method to stop the pool. won't that be a problem when testing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have extracted a managedexecutor class. it does not seem a missing stop method is a problem when testing...

dlt/pipeline/trace.py Show resolved Hide resolved
@@ -27,6 +27,8 @@ class RunConfiguration(BaseConfiguration):
request_max_retry_delay: float = 300
"""Maximum delay between http request retries"""
config_files_storage_path: str = "/run/config/"
"""Platform connection"""
platform_dsn: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is workspace cookie part of platform_dsn? also this is a secret value so use TSecretStrValue here. maybe we could rename it to dlthub_dsn?

@sh-rp sh-rp force-pushed the d#/platform_connection branch from ecb1e22 to 4b9adae Compare November 21, 2023 16:16
@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 21, 2023

Questions:

  • I am not quite sure i have implemented reporting the current state data to the platform the way you had in mind, let me know.
  • I should probably check the load packages and not sync schemas for failed loadpackages? I am not 100 % sure on that atm.
  • Maybe the whole state/schema synching should not be in the trace decorator at all. It should work the way I implemented it, but somehow it does not feel quite right.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure i have implemented reporting the current state data to the platform the way you had in mind, let me know.

LGTM. I think we will modify it a lot when we actually use it for something :) also see my comments.

I should probably check the load packages and not sync schemas for failed loadpackages? I am not 100 % sure on that atm.
heh good question and another nuance

  1. you should not bump schema revision in the dataset to which we were loading if the package failed. soon load info will have an information if schema was upgraded in dataset - even if the package failed (refactor NormalizeInfo and LoadInfo #757 )
  2. you should bump pipeline schema revision to the one that is in state.

pipeline may be ahead of the dataset in terms of schema revision. it may produce load packages that are loaded somewhere else. so we have revisions materialized in dataset and current revision in the pipeline. they may be different

Maybe the whole state/schema synching should not be in the trace decorator at all. It should work the way I implemented it, but somehow it does not feel quite right.
depends if we think it has anything useful for open telemetry collector.

the traces are open telemetry friendly. the state sync MAY BE something more spcific. but for now - LGTM

and we can merge that branch soon



TExecInfoNames = Literal[
"kubernetes",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look also at this. maybe we can copy code form there to have even more CI envs?
https://www.npmjs.com/package/ci-info

also there's CI env flag which says that code runs in CI so maybe we should add "generic_ci"

@@ -245,8 +245,8 @@ def run(
)

# plug default tracking module
from dlt.pipeline import trace, track
trace.TRACKING_MODULE = track
from dlt.pipeline import trace, track, platform
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for platform but maybe we should just say opentelemetry?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the moment it is not opentelemetry format at all. i can rename it, but i would just say we switch to opentelemetry when the prototype is out and then also rename this file.

dlt/pipeline/pipeline.py Show resolved Hide resolved
@@ -329,7 +333,7 @@ def normalize(self, workers: int = 1, loader_file_format: TLoaderFileFormat = No
except Exception as n_ex:
raise PipelineStepFailed(self, "normalize", n_ex, normalize.get_normalize_info()) from n_ex

@with_runtime_trace
@with_runtime_trace()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd send the state here

@@ -382,7 +386,7 @@ def load(
except Exception as l_ex:
raise PipelineStepFailed(self, "load", l_ex, self._get_load_info(load)) from l_ex

@with_runtime_trace
@with_runtime_trace(send_state=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not here

if not load_info:
return

payload = TSchemaSyncPayload(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like PipelineStateSync. LGTM for now. we can add more data to it when we need it. we could also add the pipeline state. it will help with debugging. there's a method to retrieve it in the pipeline and methods to serialize it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed it

# Conflicts:
#	dlt/common/pipeline.py
#	dlt/common/runtime/exec_info.py
#	dlt/common/runtime/segment.py
#	dlt/common/storages/load_storage.py
#	dlt/pipeline/__init__.py
#	dlt/pipeline/pipeline.py
#	dlt/pipeline/trace.py
#	dlt/pipeline/track.py
#	tests/common/configuration/test_configuration.py
#	tests/helpers/streamlit_tests/test_streamlit_show_resources.py
@sh-rp sh-rp force-pushed the d#/platform_connection branch from 5800e72 to 2d280e1 Compare November 23, 2023 09:58
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sh-rp sh-rp merged commit cfb6e66 into devel Nov 24, 2023
44 checks passed
@AstrakhantsevaAA AstrakhantsevaAA deleted the d#/platform_connection branch November 29, 2023 14:41
@AstrakhantsevaAA AstrakhantsevaAA restored the d#/platform_connection branch November 29, 2023 14:41
@rudolfix rudolfix deleted the d#/platform_connection branch December 6, 2023 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants