Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PUDL to use Pydantic 2 #2842

Closed
20 of 21 tasks
Tracked by #2384
zaneselvans opened this issue Sep 10, 2023 · 6 comments · Fixed by #3051
Closed
20 of 21 tasks
Tracked by #2384

Update PUDL to use Pydantic 2 #2842

zaneselvans opened this issue Sep 10, 2023 · 6 comments · Fixed by #3051
Assignees
Labels
dependencies Pull requests that update a dependency file

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Sep 10, 2023

We can't switch over to using Pydantic 2 until Dagster supports it but I wanted a place to collect tasks for once it's supported.

Tasks

Issues to Create

@zaneselvans zaneselvans added the dependencies Pull requests that update a dependency file label Sep 10, 2023
@zaneselvans zaneselvans self-assigned this Nov 16, 2023
@zaneselvans zaneselvans moved this from New to In progress in Catalyst Megaproject Nov 16, 2023
@zaneselvans zaneselvans linked a pull request Nov 16, 2023 that will close this issue
8 tasks
@zaneselvans
Copy link
Member Author

I'm currently getting an error in the conversion of our settings classes into Dagster configs, I think because of the taxonomy URL being part of our settings. But that element is getting a type of Any in the conversion right now, so not sure why Dagster is still finding it to be a URL:

TypeError: Object of type Url is not JSON serializable

I was trying to look at converting our settings classes over to being Dagster Configs directly, but it seems like there are only two high-level Dagster Configs that we're constructing, one for all of the FERC to SQLite conversions, and one of all the rest of the datasets. Is that what we want? Would that mean we need to consolidate all of our existing ETL settings together into a single class?

@zaneselvans
Copy link
Member Author

It seems like we just have to make all of the AnyHttpUrl attributes into strings to be compatible with the Dagster configs.

@bendnorman
Copy link
Member

bendnorman commented Nov 20, 2023

The current set up I created for sure feels awkward / not usable with Pydantic 2.0 and the new dagster Config based on pydantic. I made a DatasetsSettings object as a Resource so any asset can access it. I also wrote the create_dagster_config() function to convert the DatasetSettings class into a Legacy dagster run configuration so we could configure the dataset_settings resource at run time. While a bit janky, it felt like the only way to translate our existing ETL configuration into dagster world.

I think it makes sense to convert our pydnatic settings classes directly into dagster config now that dagster configs can be pydantic classes. No need to have legacy configuration passed to a pydantic resource to do validation dagster's legacy config system didn't support.

I think we can do this using nested dagster configurations. Something like this:

from dagster import asset, Config, RunConfig


class EpaCemsSettings(Config):
    years: list[int]
    states: list[str]
    

class DatasetsSettings(Config):
    eia: EiaSettings
    epacems: EpaCemsSettings
    ...


@asset
def extract_eia860(config: DatasetsSettings):
      years = config.eia.eia860.years
      ...

I wonder if it still make sense to have all of our dataset configs to live one DatasetsSettings class so that assets can access all dataset configurations. I think we did this originally because there was some existing functionality that expected all of the data source config. I'll take a look.

@bendnorman
Copy link
Member

bendnorman commented Nov 20, 2023

Ah but if we don't keep a configurable Resource around the syntax for configuring the assets gets weird:

assets:
   extract_eia860:
       config:
           years: [2020,2021]
   glue:
       config:
           years: [2020,2021]
   ...
   # every asset that depends on dataset configuration

@bendnorman
Copy link
Member

Ok I finally got through all of the new ConfigurableResource and new Config documentation. I also went through all of our assets that depend on the datasets_settings resource. Most of the assets depend on config specific to the data source it's processing. For example extract_ferc714 uses dataset_settings.ferc714 to figure out which years to process. However, there is configuration information that needs to be shared among multiple assets. For example, dataset_settings.eia.eia860.eia860m is referenced in multiple assets so it doesn't make sense to have asset-specific configuration.

I think we should keep all of our dataset configurations as a single resource. It seems like we can create nested ConfigurableResources. I was able to create a toy example of what this might look like in a dagster question I posted that I quickly answered myself 🤦

I need to play around with this a bit more but I must eat dinner now :)

@zaneselvans
Copy link
Member Author

Okay so... a Config is a per-asset setting, while a Resource or ConfigurableResource is globally accessible through the context? And because we have a few cases in which there's ETL setting information that pertains to one dataset, but that needs to be accessed by multiple assets, we need to use a ConfigurableResource?

Do we actually want these to be configurable at run time in Launchpad? Or should they always be read from the settings YAML files?

I've found that when I touch configurations in launchpad, the changes stick around and get messed up, with extra copies of individual years and other weirdness. Don't we pretty much always want to run the DAG based on either the fast or the full ETL settings? Having a GUI that each of us could have twiddled and thus end up running totally different configurations that aren't linked back to what's in the repo seems like a recipe for confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants