Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend layout placeholder params for filesystem destinations #1149

Closed
wants to merge 44 commits into from

Conversation

sultaniman
Copy link
Contributor

@sultaniman sultaniman commented Mar 26, 2024

This PR addresses #930

Requirements:

  • Make sure you can place something after the extension,
  • Should be able to pass custom placeholders dictionary to fs and merge w/ existing placeholders,
    • if placeholder is not known then show a warning and skip,
    • add validation for placeholders if they are defined with user friendly messages,
    • support custom current date formatting placeholders (interpolation options on top of pendulum or ISO standard)
  • Placeholders in dictionary can be values or callbacks,
  • Make configurable via config toml,
  • We would like to have overridable current date to filesystem arguments

@sh-rp recommends to test with staging destination
you’ll have to test it, write a test where you run the destination settings that use staging and add a map function for the filename

TODO

  • Tests,
  • Docs update

@sultaniman sultaniman added enhancement New feature or request destination Issue related to new destinations labels Mar 26, 2024
@sultaniman sultaniman self-assigned this Mar 26, 2024
Copy link

netlify bot commented Mar 26, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit a0cad82
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/660d6395d2b8d80009b75363

@sultaniman sultaniman force-pushed the issue-930 branch 2 times, most recently from e8b0c33 to be088ee Compare April 2, 2024 12:15
Comment on lines 61 to 67
destinatination_kwargs: t.Dict[str, t.Any] = {}
if current_datetime:
destinatination_kwargs["current_datetime"] = current_datetime

if datetime_format:
destinatination_kwargs["datetime_format"] = datetime_format

if extra_params:
destinatination_kwargs["extra_params"] = extra_params
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am passing these three arguments via kwargs so here on resolve I reconcile and can override the values from config.toml.
It doesn't feel right however is it fine @sh-rp @rudolfix ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK factory is tricky and its main goal is to delay instantiation, population of the configs until runtime (ie. load step) while the UX is like you can create destination at any moment.

look at base init

def __init__(self, **kwargs: Any) -> None:
        # Create initial unresolved destination config
        # Argument defaults are filtered out here because we only want arguments passed explicitly
        # to supersede config from the environment or pipeline args
        sig = inspect.signature(self.__class__.__init__)
        params = sig.parameters
        self.config_params = {
            k: v for k, v in kwargs.items() if k not in params or v != params[k].default
        }

it will store all unknown paramters in config_params
and then here

def configuration(self, initial_config: TDestinationConfig) -> TDestinationConfig:
        """Get a fully resolved destination config from the initial config"""
        config = resolve_configuration(
            initial_config,
            sections=(known_sections.DESTINATION, self.destination_name),
            # Already populated values will supersede resolved env config
            explicit_value=self.config_params,
        )
        return config

whenever configuration is created, we use those params as explicit initials

outcome: you do not need any special code to handle that. just pass additional stuff to super() init and it will be pushed to config during resolution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@sultaniman sultaniman changed the title WIP: Add current_datetime, datetime_format and layout_params to filesystem WIP: Extend layout placeholder params for filesystem destinations Apr 2, 2024
@sultaniman sultaniman marked this pull request as ready for review April 3, 2024 10:04
@sultaniman sultaniman requested a review from Pipboyguy April 3, 2024 10:04
@sultaniman sultaniman changed the title WIP: Extend layout placeholder params for filesystem destinations Extend layout placeholder params for filesystem destinations Apr 3, 2024
dlt/common/storages/configuration.py Show resolved Hide resolved
Comment on lines 61 to 67
destinatination_kwargs: t.Dict[str, t.Any] = {}
if current_datetime:
destinatination_kwargs["current_datetime"] = current_datetime

if datetime_format:
destinatination_kwargs["datetime_format"] = datetime_format

if extra_params:
destinatination_kwargs["extra_params"] = extra_params
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK factory is tricky and its main goal is to delay instantiation, population of the configs until runtime (ie. load step) while the UX is like you can create destination at any moment.

look at base init

def __init__(self, **kwargs: Any) -> None:
        # Create initial unresolved destination config
        # Argument defaults are filtered out here because we only want arguments passed explicitly
        # to supersede config from the environment or pipeline args
        sig = inspect.signature(self.__class__.__init__)
        params = sig.parameters
        self.config_params = {
            k: v for k, v in kwargs.items() if k not in params or v != params[k].default
        }

it will store all unknown paramters in config_params
and then here

def configuration(self, initial_config: TDestinationConfig) -> TDestinationConfig:
        """Get a fully resolved destination config from the initial config"""
        config = resolve_configuration(
            initial_config,
            sections=(known_sections.DESTINATION, self.destination_name),
            # Already populated values will supersede resolved env config
            explicit_value=self.config_params,
        )
        return config

whenever configuration is created, we use those params as explicit initials

outcome: you do not need any special code to handle that. just pass additional stuff to super() init and it will be pushed to config during resolution

dlt/destinations/impl/filesystem/factory.py Outdated Show resolved Hide resolved
self.PROTOCOL_CREDENTIALS.get(self.protocol) or Optional[CredentialsConfiguration] # type: ignore[return-value]
)

def on_resolved(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do not need this method at all. config will be populated automatically

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK you need it to validate the layout. at this moment you know all the config values. some come from providers some from factory,

so here you need to validate the layout. see if all placeholders are known. if not we need a really good explanation which are missing etc.

self.destination_file_name = make_filename(config, file_name, schema_name, load_id)

# Make sure directory exists before moving file
dir_path = self.make_remote_path(only_dir=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't do that. fsspec does that in put_file. this code is not needed. what was your reason to add it?
OFC this smells like race conditions that's why we try to "create" dirs in initialize. btw. buckets do not have dirs really

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rudolfix the strange thing was that ffspec didn't create directory that's why I added it, I will remove this and see if test pipeline will work out.



def get_table_prefix_layout(
config: FilesystemDestinationClientConfiguration,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully this will merge into one nice path_utils or layout.py :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes layout.py is temporary thing to keep things clean and move things to path_utils before merging.

) -> None:
self.params = params
self.allowed_placeholders = allowed_placeholders.copy()
self.layout_placeholders = re.findall(r"\{(.*?)\}", path_layout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong with using path_utils method?

if self.config.extra_params:
for key, value in self.config.extra_params.items():
if callable(value):
self._params[key] = value(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you must catch TypeError which indicates a wrong signature and explain what the correct signature is. otherwise people will be confused

ideally you could inspect all the signatures when validating placeholders

"{table_name}/{year}/{month}/{day}/{load_id}.{file_id}.{ext}",
"{table_name}/{day}/{hour}/{minute}/{load_id}.{file_id}.{ext}",
"{table_name}/{timestamp}/{load_id}.{file_id}.{ext}",
"{table_name}/{dow}/{load_id}.{file_id}.{ext}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add
{YY}/{MM}/{DD}/{table_name}/... and see where it fails

super().__init__(schema, config)
self.fs_client, self.fs_path = fsspec_from_config(config)
self.config: FilesystemDestinationClientConfiguration = config
# verify files layout. we need {table_name} and only allow {schema_name} before it, otherwise tables
# cannot be replaced and we cannot initialize folders consistently
self.table_prefix_layout = path_utils.get_table_prefix_layout(config.layout)
self.table_prefix_layout = get_table_prefix_layout(config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll comment on that in next review. this is a bigger chunk. we want to support all paths in append-only and require old layout only when we replace data (we can't guarantee replacement otherwise)

@sultaniman
Copy link
Contributor Author

Closing this in favor of #1182

@sultaniman sultaniman closed this Apr 4, 2024
@sultaniman sultaniman deleted the issue-930 branch April 4, 2024 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
destination Issue related to new destinations enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants