prototype platform connection #727

sh-rp · 2023-10-31T09:33:36Z

Adds support for sending traces to the dlthub platform. Implements:

Support for multiple tracking modules
Sending full run trace after pipeline was run to platform on a separate thread
Adds pipeline name to pipeline trace
Adds Schemas to LoadInfo
Adds execution context to pipeline trace

netlify · 2023-10-31T09:33:40Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`d453db3`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/655f48018c08e70008fc5e4a

sh-rp · 2023-10-31T09:34:01Z

dlt/common/storages/load_storage.py

@@ -87,6 +87,7 @@ class LoadPackageInfo(NamedTuple):
    package_path: str
    state: TLoadPackageState
    schema_name: str
+    schema: Schema


fyi i am adding the full schema to the load info here

IMO way better if you add TypedDict with schema content, not the object itself. this is being pickled and dumped into trace so obviously dicts work better

sh-rp · 2023-10-31T09:34:27Z

dlt/pipeline/platform.py

+    pass
+
+def on_end_trace(trace: PipelineTrace, pipeline: SupportsPipeline) -> None:
+    _send_to_beacon(trace, None, pipeline, None)


for now only send on end trace, but it would be nice to have the full progress available

sh-rp · 2023-10-31T09:36:17Z

one thing that is a bit strange in the trace is, that there is this additional run step which duplicates the result of the loadinfo as stepinfo. It's not a problem really, but for one the run step is just a representation of the full trace really, and we are sending too much data around, especially given the fact that the loadinfo will always be the most verbose.

rudolfix

two suggestions :) also do you want to merge it or keep a branch for testing?

rudolfix · 2023-10-31T16:21:41Z

dlt/common/storages/load_storage.py

@@ -87,6 +87,7 @@ class LoadPackageInfo(NamedTuple):
    package_path: str
    state: TLoadPackageState
    schema_name: str
+    schema: Schema


IMO way better if you add TypedDict with schema content, not the object itself. this is being pickled and dumped into trace so obviously dicts work better

rudolfix · 2023-10-31T16:55:32Z

dlt/pipeline/platform.py

+    if pipeline.runtime_config.beacon_token and pipeline.runtime_config.beacon_url:
+        trace_dump = json.dumps(trace.asdict())
+        url = f"{pipeline.runtime_config.beacon_url}/pipeline/{pipeline.runtime_config.beacon_token}/traces"
+        requests.put(url, json=trace_dump)


maybe you could reuse our telemetry thread executor to send the messages in the thread, without blocking the pipeline execution? the code is already there and I think could be repurposed.

sh-rp · 2023-11-01T10:16:36Z

two suggestions :) also do you want to merge it or keep a branch for testing?

For now this is only for testing i would say.

dlt/common/pipeline.py

sh-rp · 2023-11-01T10:43:58Z

dlt/pipeline/trace.py

    return trace


 def start_trace_step(trace: PipelineTrace, step: TPipelineStep, pipeline: SupportsPipeline) -> PipelineStepTrace:
    trace_step = PipelineStepTrace(uniq_id(), step, pendulum.now())
-    with suppress_and_warn():
-        TRACKING_MODULE.on_start_trace_step(trace, step, pipeline)
+    trace.steps.append(trace_step)


is there any downside to attaching the trace step on trace start? it would be nicer to have it like this when sent to the platform

change config attribute to platform_dsn add exeuction context info to pipeline trace add pipeline name to pipeline trace

rudolfix · 2023-11-21T10:24:05Z

one thing that is a bit strange in the trace is, that there is this additional run step which duplicates the result of the loadinfo as stepinfo. It's not a problem really, but for one the run step is just a representation of the full trace really, and we are sending too much data around, especially given the fact that the loadinfo will always be the most verbose.

run step is used to correlate several pipeline steps into one transaction. it makes sense IMO to send some summary information. yeah I think we repeat the LoadInfo (or the last step info) in it? then let's change it to RunInfo with just basic information. we could also use it to send full state sync (but not everyone uses run method so probably bad idea

rudolfix

good! before we merge:

your thread pool must register in at_exit best if you could refactor segment pool to have many pools registered there
as usual some naming suggestions

dlt/common/pipeline.py

rudolfix · 2023-11-21T10:37:44Z

dlt/common/runtime/typing.py

+from typing import Any, Callable, Dict, List, Literal, Optional, Sequence, Set, Type, TypedDict, NewType, Union, get_args
+
+
+TExecInfoNames = Literal["kubernetes", "docker", "codespaces", "github_actions", "airflow", "notebook", "colab","aws_lambda","gcp_cloud_function"]


could you format it? or maybe we should go back to blake formatter initially in the same mode we have in verified sources?

i am also for black formatter..

rudolfix · 2023-11-21T10:38:14Z

dlt/common/runtime/typing.py

+    name: str
+    version: str
+
+class TExecutionContext(TypedDict):


Looking good! for open telemetry we can collect way more information that is not anonymous. but in other PR

rudolfix · 2023-11-21T10:38:54Z

dlt/common/storages/load_storage.py

@@ -87,6 +88,7 @@ class LoadPackageInfo(NamedTuple):
    package_path: str
    state: TLoadPackageState
    schema_name: str
+    schema: TStoredSchema


schema hash. send a separate pipeine state message with all pipeline schemas

rudolfix · 2023-11-21T10:42:09Z

dlt/pipeline/platform.py

+from dlt.common import json
+from dlt.common.runtime import logger
+
+_THREAD_POOL: ThreadPoolExecutor = None


is this temporary code? we should reuse the segment pool or at least somehow add this pool to at_exit handler otherwise messages will not go out at the end.

maybe extract this sending pool to common module that creates a pool and registers it in exit handler?

also implementation below does not have a method to stop the pool. won't that be a problem when testing?

I have extracted a managedexecutor class. it does not seem a missing stop method is a problem when testing...

dlt/pipeline/trace.py

rudolfix · 2023-11-21T10:46:43Z

dlt/common/configuration/specs/run_configuration.py

@@ -27,6 +27,8 @@ class RunConfiguration(BaseConfiguration):
    request_max_retry_delay: float = 300
    """Maximum delay between http request retries"""
    config_files_storage_path: str = "/run/config/"
+    """Platform connection"""
+    platform_dsn: Optional[str] = None


is workspace cookie part of platform_dsn? also this is a secret value so use TSecretStrValue here. maybe we could rename it to dlthub_dsn?

# Conflicts: # dlt/pipeline/pipeline.py

sh-rp · 2023-11-21T16:22:04Z

Questions:

I am not quite sure i have implemented reporting the current state data to the platform the way you had in mind, let me know.
I should probably check the load packages and not sync schemas for failed loadpackages? I am not 100 % sure on that atm.
Maybe the whole state/schema synching should not be in the trace decorator at all. It should work the way I implemented it, but somehow it does not feel quite right.

rudolfix

I am not quite sure i have implemented reporting the current state data to the platform the way you had in mind, let me know.

LGTM. I think we will modify it a lot when we actually use it for something :) also see my comments.

I should probably check the load packages and not sync schemas for failed loadpackages? I am not 100 % sure on that atm.
heh good question and another nuance

you should not bump schema revision in the dataset to which we were loading if the package failed. soon load info will have an information if schema was upgraded in dataset - even if the package failed (refactor NormalizeInfo and LoadInfo #757 )
you should bump pipeline schema revision to the one that is in state.

pipeline may be ahead of the dataset in terms of schema revision. it may produce load packages that are loaded somewhere else. so we have revisions materialized in dataset and current revision in the pipeline. they may be different

Maybe the whole state/schema synching should not be in the trace decorator at all. It should work the way I implemented it, but somehow it does not feel quite right.
depends if we think it has anything useful for open telemetry collector.

the traces are open telemetry friendly. the state sync MAY BE something more spcific. but for now - LGTM

and we can merge that branch soon

rudolfix · 2023-11-21T18:32:55Z

dlt/common/runtime/typing.py

+
+
+TExecInfoNames = Literal[
+    "kubernetes",


look also at this. maybe we can copy code form there to have even more CI envs?
https://www.npmjs.com/package/ci-info

also there's CI env flag which says that code runs in CI so maybe we should add "generic_ci"

rudolfix · 2023-11-21T18:34:18Z

dlt/pipeline/__init__.py

@@ -245,8 +245,8 @@ def run(
    )

 # plug default tracking module
-from dlt.pipeline import trace, track
-trace.TRACKING_MODULE = track
+from dlt.pipeline import trace, track, platform


ok for platform but maybe we should just say opentelemetry?

at the moment it is not opentelemetry format at all. i can rename it, but i would just say we switch to opentelemetry when the prototype is out and then also rename this file.

dlt/pipeline/pipeline.py

rudolfix · 2023-11-21T18:37:02Z

dlt/pipeline/pipeline.py

@@ -329,7 +333,7 @@ def normalize(self, workers: int = 1, loader_file_format: TLoaderFileFormat = No
            except Exception as n_ex:
                raise PipelineStepFailed(self, "normalize", n_ex, normalize.get_normalize_info()) from n_ex

-    @with_runtime_trace
+    @with_runtime_trace()


I'd send the state here

rudolfix · 2023-11-21T18:37:11Z

dlt/pipeline/pipeline.py

@@ -382,7 +386,7 @@ def load(
        except Exception as l_ex:
            raise PipelineStepFailed(self, "load", l_ex, self._get_load_info(load)) from l_ex

-    @with_runtime_trace
+    @with_runtime_trace(send_state=True)


rudolfix · 2023-11-21T18:45:05Z

dlt/pipeline/platform.py

+    if not load_info:
+            return
+
+    payload = TSchemaSyncPayload(


looks like PipelineStateSync. LGTM for now. we can add more data to it when we need it. we could also add the pipeline state. it will help with debugging. there's a method to retrieve it in the pipeline and methods to serialize it

# Conflicts: # dlt/common/pipeline.py # dlt/common/runtime/exec_info.py # dlt/common/runtime/segment.py # dlt/common/storages/load_storage.py # dlt/pipeline/__init__.py # dlt/pipeline/pipeline.py # dlt/pipeline/trace.py # dlt/pipeline/track.py # tests/common/configuration/test_configuration.py # tests/helpers/streamlit_tests/test_streamlit_show_resources.py

rudolfix

LGTM!

sh-rp commented Oct 31, 2023

View reviewed changes

rudolfix requested changes Oct 31, 2023

View reviewed changes

sh-rp force-pushed the d#/platform_connection branch from cffa87d to 022d3f0 Compare November 1, 2023 09:25

sh-rp changed the title ~~D#/platform connection~~ prototype platform connection Nov 1, 2023

sh-rp commented Nov 1, 2023

View reviewed changes

dlt/common/pipeline.py Show resolved Hide resolved

sh-rp commented Nov 1, 2023

View reviewed changes

sh-rp force-pushed the d#/platform_connection branch from 022d3f0 to b65bb63 Compare November 1, 2023 11:43

rudolfix added the sprint label Nov 17, 2023

sh-rp force-pushed the d#/platform_connection branch from b65bb63 to 11c8cc7 Compare November 20, 2023 10:47

sh-rp added 4 commits November 20, 2023 11:48

start implementation of platform trace support

1d4995d

add beacon integration

2a6bcb1

small changes

fea80fa

fix tests

f08b795

sh-rp force-pushed the d#/platform_connection branch from 11c8cc7 to f08b795 Compare November 20, 2023 10:48

sh-rp added 4 commits November 20, 2023 14:01

add sending on threads to platform connection

dc81f05

change config attribute to platform_dsn add exeuction context info to pipeline trace add pipeline name to pipeline trace

fix config test

d310949

revert to adding tracestep on end tracestep

5758e05

add test for platform connection

03fcc21

sh-rp marked this pull request as ready for review November 20, 2023 14:02

rudolfix requested changes Nov 21, 2023

View reviewed changes

pr fixes

4b9adae

sh-rp force-pushed the d#/platform_connection branch from ecb1e22 to 4b9adae Compare November 21, 2023 16:16

sh-rp added 2 commits November 21, 2023 17:20

Merge branch 'devel' into d#/platform_connection

5484e22

# Conflicts: # dlt/pipeline/pipeline.py

fix tests

4194284

fix linting

1770d0d

rudolfix requested changes Nov 21, 2023

View reviewed changes

sh-rp force-pushed the d#/platform_connection branch from 5800e72 to 2d280e1 Compare November 23, 2023 09:58

small pr changes

d453db3

rudolfix approved these changes Nov 23, 2023

View reviewed changes

sh-rp merged commit cfb6e66 into devel Nov 24, 2023
43 checks passed

AstrakhantsevaAA deleted the d#/platform_connection branch November 29, 2023 14:41

AstrakhantsevaAA restored the d#/platform_connection branch November 29, 2023 14:41

rudolfix deleted the d#/platform_connection branch December 6, 2023 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prototype platform connection #727

prototype platform connection #727

sh-rp commented Oct 31, 2023 •

edited

Loading

netlify bot commented Oct 31, 2023 •

edited

Loading

sh-rp Oct 31, 2023

rudolfix Oct 31, 2023

sh-rp Oct 31, 2023

sh-rp commented Oct 31, 2023

rudolfix left a comment

rudolfix Oct 31, 2023

rudolfix Oct 31, 2023

sh-rp commented Nov 1, 2023

sh-rp Nov 1, 2023

rudolfix commented Nov 21, 2023

rudolfix left a comment

rudolfix Nov 21, 2023

sh-rp Nov 21, 2023

rudolfix Nov 21, 2023

rudolfix Nov 21, 2023

rudolfix Nov 21, 2023

sh-rp Nov 21, 2023

rudolfix Nov 21, 2023

sh-rp commented Nov 21, 2023 •

edited

Loading

rudolfix left a comment

rudolfix Nov 21, 2023

rudolfix Nov 21, 2023

sh-rp Nov 23, 2023

rudolfix Nov 21, 2023

rudolfix Nov 21, 2023

rudolfix Nov 21, 2023

sh-rp Nov 23, 2023

rudolfix left a comment

		from typing import Any, Callable, Dict, List, Literal, Optional, Sequence, Set, Type, TypedDict, NewType, Union, get_args


		TExecInfoNames = Literal["kubernetes", "docker", "codespaces", "github_actions", "airflow", "notebook", "colab","aws_lambda","gcp_cloud_function"]

prototype platform connection #727

prototype platform connection #727

Conversation

sh-rp commented Oct 31, 2023 • edited Loading

netlify bot commented Oct 31, 2023 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Oct 31, 2023

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Nov 1, 2023

Choose a reason for hiding this comment

rudolfix commented Nov 21, 2023

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Nov 21, 2023 • edited Loading

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

sh-rp commented Oct 31, 2023 •

edited

Loading

netlify bot commented Oct 31, 2023 •

edited

Loading

sh-rp commented Nov 21, 2023 •

edited

Loading