Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor app with new predictor/prescriptor #87

Merged
merged 25 commits into from
Jun 6, 2024
Merged

Conversation

danyoungday
Copy link
Collaborator

Deleted old demo predictor/prescriptor code and updated app to use current predictor/prescriptors. Additionally refactored torch prescriptor so that the prescriptor object was standalone from the trainer.

Still todo after this PR:

  • Automate data downloading for the app. Currently spinning up the app requires you to run the process_data.py script however this script downloads the entire dataset before processing it. We should upload a demo dataset so that no processing has to be done by the app.

  • Currently predictor list is hard-coded and we will need to make it easier to add user models.

  • Prescriptors still follow ESP convention where we have a single directory with lots of models saved from a training run. Since we want to be able to add heuristics and other users' trained prescriptors we need to change the scope of the Prescriptor object from a group of prescriptors to a single model with context input and actions output.

@danyoungday danyoungday requested a review from ofrancon May 30, 2024 21:22
@danyoungday danyoungday self-assigned this May 30, 2024
df = pd.read_csv(constants.DATA_FILE_PATH, index_col=constants.INDEX_COLS)
countries_df = regionmask.defined_regions.natural_earth_v5_0_0.countries_110.to_dataframe()
df = pd.read_csv(app_constants.DATA_FILE_PATH, index_col=app_constants.INDEX_COLS)
df.rename(columns={col + ".1": col for col in app_constants.INDEX_COLS}, inplace=True)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little hacky because since we need time/lat/lon as context as well as indices we have 2 time/lat/lon columns: time, lat, lon, time.1, lat.1, lon.1 which we need to separate.

"hidden_size": 16,
"out_size": len(constants.RECO_COLS)
}
prescriptor = TorchPrescriptor(None, encoder, None, 1, candidate_params)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently how we handle the prescriptor loading. We save a snapshot of the encoder and hard-code the candidate params (which could also theoretically be snapshotted after training later).

# If we have no prescription just return an empty chart
if all(slider == 0 for slider in sliders):
return utils.create_treemap(pd.Series([]), type_context=False, year=year)
return utils.create_treemap(pd.Series([], dtype=float), type_context=False, year=year)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a warning that the default pandas datatype is not float

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely simple data processing script. In the future we may want to handle time/lat/lon being an index and context in a more elegant fashion.

from . import Predictor


class Encoder:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this anymore because we have data.ELUCEncoder now

@@ -106,27 +72,17 @@ def create_check_options(values: list) -> list:
"value": val})
return options


def compute_percent_change(context: pd.Series, presc: pd.Series) -> float:
def context_presc_to_df(context: pd.Series, presc: pd.Series) -> pd.DataFrame:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new function to get the demo to work with the new prescriptors. Since the prescriptor wants a context_actions dataframe, we have to convert the context which we get from our dataset and the prescriptions which we get from our sliders into a context_actions_df by computing the diff and processing the zero change columns.

"Primary Vegetation", "primf", "primn",
"Secondary Vegetation", "secdf", "secdn",
"Urban",
"Fields", "pastr", "range"]
parents = ["", title,
title, "Crops", "Crops", "C3", "C3", "C3", "C4", "C4",
title,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to redo the treemap code because we no longer have specific crop types.

plo = px.colors.qualitative.Plotly
dar = px.colors.qualitative.Dark24
#['crop', 'pastr', 'primf', 'primn', 'range', 'secdf', 'secdn', 'urban', 'nonland]
colors = [plo[4], plo[0], plo[2], dar[14], plo[5], plo[7], dar[2], plo[3], plo[1]]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redid colors for treemap because there are no longer crop types

if type_context:
fig.update_layout(showlegend=False)
# To make up for the hidden legend
fig.update_layout(margin={"t": 50, "b": 50, "l": 50, "r": 50})
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to remove this code that added margins to the left pie chart because with the new size of the pie charts (since the extra crop sliders got removed so the chart is smaller) it's no longer necessary

predictors["Global Random Forest"] = global_rf

return predictors
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is currently hard-coded to load these predictors. We can handle this more elegantly but for now users aren't submitting models and ultimately we want to be in control of the models that get added, we aren't doing it automatically, so this in theory should be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's fine

"""
with open(path, "r", encoding="utf-8") as file:
fields = json.load(file)
return cls(fields)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allows us to load an encoder from a fields file. This enables us to use a pretrained one in the demo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important we use the same encoder as the one that was used for training indeed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some linting changes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly linting changes but there are some modifications to how candidate ids are saved to align with ESP's method

"parents": self.parents,
"NSGA-II_rank": self.rank, # Named this to match ESP
"distance": self.distance,
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we save our candidate id as "id" in our logs. Not a huge fan of this change since "id" is a python builtin so we can't name our variables that but it matches ESP. This is still nicer than what I had before which was using gen and id as a double index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to match ESP anymore.
distance is the NSGA-II_distance though, right? Maybe we
id could become cid for candidate id if needed. But dictionary keys don't conflict with python buildins so as long as we don't call our variables id we're fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all the training code from this class to prescriptors/nsga2/trainer.py

@@ -30,7 +30,7 @@
print("Initializing prescription...")
if "seed_dir" in config["evolution_params"].keys():
config["evolution_params"]["seed_dir"] = Path(config["evolution_params"]["seed_dir"])
tp = TorchPrescriptor(
tp = TorchTrainer(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses new trainer object instead of directly training a TorchPrescriptor

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just cut and pasted from TorchPrescriptor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests for some functions in the app. Still not nearly robust enough.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old tests that tested the compute_percent_change function from the demo now adapted to our new compute_percent_change in Prescriptor class.

Copy link
Member

@ofrancon ofrancon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm


recursive=y

fail-under=9.0
fail-under=9.65
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍
Little by little we can get to 10.

"""
Main function that loads the data and saves it.
"""
dataset = ELUCData(APP_START_YEAR-1, APP_START_YEAR, 2022)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to know, without looking at the code, that the params are start_year, test_year and end_year. Maybe you can name the params here.
Should 2022 be a constant too? It's hard to know what where loading here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to be more clear with comments and pass the args in as kwargs

:param presc: Prescribed land use data
:return: Percent land use change
Takes a context with all columns and a presc with RECO_COLS and returns an updated context actions df.
This df takes the difference between the RECO_COLS in presc and context and sets the DIFF_RECO_COLS to that.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters should still be documented, but we can take care of that later

predictors = {}
nn_path = "danyoung/eluc-global-nn"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could move to constants

predictors["Global Random Forest"] = global_rf

return predictors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's fine

@@ -17,7 +17,7 @@
"RUS": 643,
"N": 578,
"F": 250,
"J": 388,
# "J": 388,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was wrong with "J"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple countries with J in the converter I found. At some point the country code to name conversion scheme may have to be overhauled because it doesn't work for some countries.

"""
with open(path, "r", encoding="utf-8") as file:
fields = json.load(file)
return cls(fields)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important we use the same encoder as the one that was used for training indeed.

"parents": self.parents,
"NSGA-II_rank": self.rank, # Named this to match ESP
"distance": self.distance,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to match ESP anymore.
distance is the NSGA-II_distance though, right? Maybe we
id could become cid for candidate id if needed. But dictionary keys don't conflict with python buildins so as long as we don't call our variables id we're fine.

@danyoungday danyoungday merged commit 8a6cfa5 into main Jun 6, 2024
1 check passed
@danyoungday danyoungday deleted the refactor-app branch June 6, 2024 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants