-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify model launcher configs and add script input checks #90
Changes from 1 commit
e1be850
81b022f
57a4a81
e678145
d64e237
8d12aed
c631e93
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is possible we are trying to hard to have the configs be "the same" or "similar" between tabularization and modeling. I think separating them out more would be good, because then you can not have the ambiguity of things like the "output cohort dir" meaning an output during tabularization and an input during modeling, etc. I'm not sure what exactly that would look like, but something in there I think would probably be smart. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ahhh I see, @teyaberg proposed we call it |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,26 +3,25 @@ defaults: | |
- tabularization: default | ||
- imputer: default | ||
- normalization: default | ||
- model_launcher: autogluon | ||
- _self_ | ||
|
||
task_name: task | ||
task_name: ??? | ||
|
||
# Task cached data dir | ||
input_dir: ${output_cohort_dir}/${task_name}/task_cache | ||
# Directory with task labels | ||
input_label_dir: ${output_cohort_dir}/${task_name}/labels/ | ||
# Where to output the model and cached data | ||
model_dir: ${output_cohort_dir}/autogluon/autogluon_${now:%Y-%m-%d_%H-%M-%S} | ||
model_log_dir: ${model_dir}/.logs/ | ||
output_filepath: ${model_dir} | ||
|
||
# Model parameters | ||
model_params: | ||
iterator: | ||
keep_data_in_memory: True | ||
binarize_task: True | ||
|
||
log_dir: ${model_dir}/.logs/ | ||
log_filepath: ${log_dir}/log.txt | ||
output_dir: ??? | ||
|
||
name: launch_autogluon | ||
|
||
hydra: | ||
verbose: False | ||
job: | ||
name: MEDS_TAB_${name}_${worker}_${now:%Y-%m-%d_%H-%M-%S} | ||
Oufattole marked this conversation as resolved.
Show resolved
Hide resolved
|
||
sweep: | ||
dir: ${model_log_dir} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, my first comment here was right. I don't see where this is defined at the top level. It might be defined nested within a sub-config, but I don't think this will work in that case. |
||
run: | ||
dir: ${model_log_dir} |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe the whole There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And some other params, actually... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this yaml can be used for all model launchers (autogluon, sklearn, and xgboost), and we can add a stage check for autogluon that makes sure users do not apply multirun which will apply the overrides for hydra/sweeper hydra/callbacks and hydra/launcher. |
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
defaults: | ||
- default | ||
- _self_ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
keep_data_in_memory: True | ||
binarize_task: True |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
defaults: | ||
- imputer: default | ||
- normalization: default |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
defaults: | ||
- path: default | ||
- data_processing_params: default | ||
- data_loading_params: default | ||
- _self_ | ||
|
||
tabularization: ${tabularization} | ||
mmcdermott marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
hydra: | ||
sweeper: | ||
direction: maximize | ||
n_trials: 250 | ||
n_jobs: 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
MEDS_cohort_dir
used for anything else? If not, do we need to specify it andinput_dir
separately? Can we just have one parameter, which will help avoid the confusion that comes about in the setting where you are or aren't using a resharding stage (b/c when you are using a re-sharding stage, the rawMEDS_cohort_dir
is only the input to that first resharding stage)