Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply naming convention to output assets #2788

Closed
9 tasks done
Tracked by #2765
bendnorman opened this issue Aug 14, 2023 · 15 comments · Fixed by #2818
Closed
9 tasks done
Tracked by #2765

Apply naming convention to output assets #2788

bendnorman opened this issue Aug 14, 2023 · 15 comments · Fixed by #2818
Assignees

Comments

@bendnorman
Copy link
Member

bendnorman commented Aug 14, 2023

Apply naming convention to output layer assets.

Tasks

Questions

  • How many assets in the output layer are not accessed via PudlTabl?
  • Which tables are available via PudlTabl but should probably be hidden from users going forward? Do we want to rename these tables using a preceding underscore and tell users they will be deprecated with PudlTabl?
@bendnorman
Copy link
Member Author

Hey @katie-lamb! I'm trying to decide which output tables should be exposed to users and which ones should be intermediate assets that don't get persisted to the database. Do you think all assets in the allocate_gen_fuel and mcoe groups should be persisted in the database? Are the assets in these groups working towards a single asset that users should interact with? For example, it looks like the assets in the mcoe group all flow towards two assets, mcoe_generators_monthly and mcoe_generators_yearly:

Image

Rows 353-369 list all of the allocate_gen_fuel and mcoe assets.

@katie-lamb
Copy link
Member

katie-lamb commented Aug 16, 2023

@bendnorman Yep, the idea behind the mcoe_generators_monthly and mcoe_generators_yearly tables is that they're the data mart counterparts of the mcoe_monthly and mcoe_yearly tables and have all the generator attributes merged onto them. We made that decision because Christina uses the data mart version of the MCOE, but the skinny version is much cleaner to read. I think I'd make a few changes to the MCOE process in an ideal world:

  • Make an intermediate asset that is upstream of MCOE and mcoe_generators that is not persisted to the DB
  • Clean up the skinny MCOE by dropping null MCOE values and intermediate columns (heat rate, fuel costs, etc)
  • Persist the MCOE and MCOE generators to the DB (is there a better way?)

As for the generation and fuel allocations - I didn't have much hand in this but i think @cmgosnell @zaneselvans wanted these outputs in the DB. Some of them take a while to generate.

The heat rate, fuel cost, and capacity factor tables do feel pretty intermediate. It was also expressed that the heat rate by generator table is a little misleading because heat rates don't really make sense at a generator level. I'm not sure if there were strong feelings about leaving this tables in the DB.

@bendnorman
Copy link
Member Author

Got it, thanks! So it sounds like everything upstream of mcoe_generators_monthly, mcoe_generators_yearly, mcoe_monthly, mcoe_yearly in the mcoe group don't need to be persisted to the database?

@zaneselvans and @cmgosnell which allocate_gen_fuel assets do think you think need to be in the db?

@cmgosnell
Copy link
Member

cmgosnell commented Aug 16, 2023

I'm personally pretty inclined to put all of the generator level heat rate, fuel cost, and capacity factor data into the mcoe_{freq} tables & not save any of the upstream tables to the db. This would mean that there are some nulls for some of the data fields but I don't think that is particularly untidy structurally because it is still well normalized - it'll just have more nulls.

The only table in this upstream mcoe stuff that is structurally different are the heat_rate_by_unit_{freq} tables so i think there is cause to db-ify those as well.

my 🪙🪙:

  • db heat_rate_by_unit_{freq} (mayybe)
  • db mcoe_{freq}
  • db mcoe_generators_{freq}

also light clarification on the heat rate by generators: the asset there is based on the unit-level heat rate which is fine! calculating heat rates from generator-level data without aggregating to units first is where things get weird. For combined cycle units in particular bc the gas turbine side is generally the side that burns the vast majority of the fuel where the steam side just uses the excess steam from the latter.

@bendnorman
Copy link
Member Author

Got it. Would we need to change the logic of these assets or can we just persist the three assets you listed and pickle everything else?

@cmgosnell
Copy link
Member

I don't think any logic would need changing... but @katie-lamb you were just in there recently and would know best.

@katie-lamb
Copy link
Member

katie-lamb commented Aug 16, 2023

Yep, I don't think any logic would need to be changed off the top of my head.

also light clarification on the heat rate by generators: the asset there is based on the unit-level heat rate which is fine! calculating heat rates from generator-level data without aggregating to units first is where things get weird.

Right right, I forgot that heat rate by generator is downstream of heat rate by unit and makes sense again. So if you did want to persist heat rate by unit then go for it (because heat rate by generator level information is in the MCOE table). Alternatively, you could put heat rate by unit as a column in the MCOE table and not persist the heat by unit table.

@bendnorman
Copy link
Member Author

bendnorman commented Aug 16, 2023

Alternatively, you could put heat rate by unit as a column in the MCOE table and not persist the heat by unit table.

I'm on a table pruning quest so I feel intrigued by this idea :) I'm not super familiar with units and heat rates so I'm not sure if a generator-level MCOE table would be a logical place to explore unit heat rates. What do y'all think?

Also, which assets in the allocate_gen_fuel group should be user facing?

@katie-lamb
Copy link
Member

katie-lamb commented Aug 16, 2023

I'm on a table pruning quest so I feel intrigued by this idea :) I'm not super familiar with units and heat rates so I'm not sure if a generator-level MCOE table would be a logical place to explore unit heat rates. What do y'all think?

Ya I agree, it's maybe not so logical. I'm not sure how important it is for unit level heat rates to be accessible.

Also, which assets in the allocate_gen_fuel group should be user facing?

I'm not entirely sure which ones are useful as outputs. My intuition is generation_fuel_by_generator_{agg}_eia923. The other ones feed into the MCOE and/or PPE process I believe.

@bendnorman bendnorman linked a pull request Aug 16, 2023 that will close this issue
8 tasks
@zaneselvans
Copy link
Member

zaneselvans commented Aug 18, 2023

heat_rate_by_gen is just heat_rate_by_unit merged onto the generators using their associated unit_id_pudl so if we have an output table that includes the plant, unit, and generator IDs, then just keeping a column that indicates the heat rate by unit seems like a great idea -- it'll be more obvious on what basis the heat rate actually makes physical sense, but it'll still be in a table associated with the generator IDs, if that's the entity someone needs to work with.

I think keeping all of the "intermediate" values that we calculate on the way to cost per MWh in the output table is important -- capacity factor and heat rate are really useful basic generator metrics to track over time and are used in a lot of modeling to describe the energy system (this is the primary value that both of the Gregs depend on from us). I think we should keep any record that has any non-null derived values, and make sure we null out any dependent values that relied upon out-of-bounds values upstream (maybe we're already doing this)

Also also... I think we should probably stop referring to this table MCOE. The only cost it includes is fuel, and mostly it's a compilation of useful unit or generator level derived values. If/when we bring non-fuel variable O&M, fixed O&M, and ongoing CapEx from FERC 1 into a table alongside the fuel costs, we'll have a real MCOE table -- that was where we thought we were headed when we started with this table. It's just taken looooooooonger than expected to get there.

@bendnorman bendnorman moved this from Backlog to In progress in Catalyst Megaproject Aug 28, 2023
@bendnorman bendnorman self-assigned this Aug 28, 2023
@bendnorman
Copy link
Member Author

Thanks for all of the input y'all! Here are a few remaining questions I have:

  • Should we rename the mcoe table given it’s missing some attributes (non-fuel variable O&M, fixed O&M, and ongoing CapEx)? Should we just keep it MCOE for now, assuming we add these attributes at some point?
  • Just to confirm, should we only persist the following assets in the MCOE group to the db?
    • heat_rate_by_unit_{freq}: Would adding a heat rate by unit column to the two following tables be an alternative?
    • db mcoe_{freq}
    • db mcoe_generators_{freq}
  • What should we do with assets in the allocate_gen_fuel group? It seems like they are all stepping stones to the final mcoe tables. However, @cmgosnell has mentioned that @arengel and @grgmiller might need access to some of these assets. In case you don't have dagster running, these are the assets in the allocate_gen_fuel group:
    • generation_fuel_by_generator_energy_source_monthly_eia923
    • generation_fuel_by_generator_energy_source_owner_yearly_eia923
    • generation_fuel_by_generator_energy_source_yearly_eia923
    • generation_fuel_by_generator_monthly_eia923
    • generation_fuel_by_generator_yearly_eia923

@arengel
Copy link
Collaborator

arengel commented Sep 11, 2023

Thanks Ben, we actually use all of those tables but I am guessing that if you just provided generation_fuel_by_generator_energy_source_monthly_eia923 and generation_fuel_by_generator_energy_source_owner_yearly_eia923, the remaining could be derived from those. If a simple SQL query can't faithfully reproduce the others then we'd want them all.

@zaneselvans
Copy link
Member

@bendnorman I'm pretty sure that @grgmiller uses some subset of those generation_fuel_by_generator tables as well. @arengel if you want to look at the asset definitions that are used to build these tables that should give a decent idea of whether a simple SQL query can reproduce them... but if you're using that information already, and we're producing it already, and it's important that the data being used definitely reflect the data as it existed in these intermediate states for internal consistency, it seems like we should probably keep them.

We've been assuming that we would add those other cost components to the MCOE table for like 5 years, and I think that there are really two distinct sets of information that's being compiled which probably each deserve their own tables. What we have right now is derived generator or unit level characteristics that ultimately impact the attribution of fuel costs, which an independent set of interesting variables from the actual electricity production cost components (which will include fuel costs and also all the other fixed & variable operating costs and capital expenditures from FERC Form 1). So I think we should rename our current MCOE table to reflect the derived generator/unit attributes or quantities that it contains, which I think are

  • Capacity Factor
  • Heat Rate
  • Fuel Cost per MMBTU of fuel consumed
  • Fuel Cost per net MWh of electricity generated

The only real heat rate is heat_rate_by_unit and I think that column should be in the output tables, along with both generator_id and unit_id_pudl which will make it clearer what "unit" the heat rate pertains to, and get rid of the false precision of having a "per generator" heat rate which is just per unit heat rate broadcast across all the generators in a unit (so I'm suggesting getting rid of heat_rate_by_generator entirely).

But over time I could imagine accumulating additional useful quantities that belong in a table like this. E.g. estimates of:

  • emissions intensity (e.g. CO2, SO2, & NOx per MWh).
  • water consumption per MWh
  • estimates of heat rate (thermal efficiency) at 20%, 40%, 60%, 80% and 100% of nameplate capacity output.
  • ramp rate in MW per unit time
  • minimum and maximum observed stable power output within a given month or year.
  • total number of hours of observed non-zero output operation within a given month or year.

What should a table like that be called? out__derived_generator_operational_characteristics? Kinda long. Seems like it would eventually contain aggregated values from CEMS as well as these derived values from EIA860/923. And maybe it does make sense for the many different cost components to end up in the same table when they're compiled. But a table with all of that information definitely wouldn't be "MCOE".

@bendnorman
Copy link
Member Author

Sounds good @arengel and @zaneselvans! We'll keep all of the assets in the allocate_gen_fuel group in the database. I generally think PUDL should provide the most granular data and useful aggregates so users don't have to expend extra time figuring out the proper way to aggregate tables.

So @zaneselvans, you're imagining having two tables, one for generator operational characteristics (what our current mcoe table is) and a future table for financial information? When we have both, we can create a proper mcoe table? How about out__{freq}_generator_operational_characteristics for the operational characteristics table? We've been pretty verbose with the names so far, which could cause issues later if we move to a database with column name length limits. We could also do something like out__{freq}_generator_ops.

@zaneselvans
Copy link
Member

An issue with out__{freq}_generator_operational_characteristics is that there are lots of operational characteristics that are reported directly, and already show up in the generator annual and entity tables. Do we want all of those characteristics to be in the same table regardless of whether they're directly reported or the result of some involved calculation the depends on reported time series data? The primary keys of the tables are likely to be the same, so structurally it seems like it would be fine. But many of the derived values will be very sparse, because they only apply to some types of generators, or there's only data available in the appropriate form for a small subset of the generators.

Would it make sense to append all of the cost components into that same table as we compile them? The generators table already has more than 100 columns, but maybe 200 columns is fine? It's not hard to just select the dozen columns you actually need for whatever calculation you're doing, and if/when we put the data in a columnar data structure it would be very efficient. And I guess that's a common pattern for data warehouses. But somehow it still feels unwieldy to me. I imagine users would end up curating their own thematic lists of columns that they use in different contexts.

@bendnorman bendnorman removed a link to a pull request Sep 21, 2023
8 tasks
@bendnorman bendnorman moved this from In progress to Backlog in Catalyst Megaproject Oct 30, 2023
@bendnorman bendnorman moved this from Backlog to In progress in Catalyst Megaproject Nov 6, 2023
@bendnorman bendnorman moved this from In progress to In review in Catalyst Megaproject Nov 6, 2023
@github-project-automation github-project-automation bot moved this from In review to Done in Catalyst Megaproject Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants