-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several fixes after running a full dataset #24
base: dev
Are you sure you want to change the base?
Conversation
|
conf/modules.config
Outdated
withName: "RUN_DYNAMITE" { | ||
errorStrategy = "ignore" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situations is this relevant? By setting the errorStrategy
to ignore
, we also prevent the pipeline from trying again if it fails due to too little RAM etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamite fails with exitStatus
139 when running with too little data, so I adapted the errorStrategy
accordingly.
rndselect=sample(x=nrow(M),size=as.numeric(argsL$testsize)*nrow(M)) | ||
# Test on a single example if dataset size is too small | ||
rndselect=sample(x=nrow(M),size=ifelse(as.numeric(argsL$testsize)*nrow(M) < 1, 1, as.numeric(argsL$testsize)*nrow(M))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is a copy from here and I would like to keep it identical if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could open a PR to their repository, however, the tool does not appear to be actively maintained. The last open PR is from 2020 and still unanswered.
The changes improve the tool's robustness when handling small dataset sizes, making it essential for a reliable pipeline and also for the run on our lactation data.
df_lengths = df_lengths.loc[df_counts.index] | ||
df_lengths = df_lengths.loc[df_lengths.index.isin(df_counts.index)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check if it can happen that the dataframes have different orders?
If we cannot be entirely sure of this, it might be better to build the intersection of both indices and then subset both to the intersection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to the index intersection you proposed. Seems not necessary for our use case, since both data frames have the same order, but it's definitely more robust this way.
@@ -14,38 +15,34 @@ def format_yaml_like(data: dict, indent: int = 0) -> str: | |||
""" | |||
yaml_str = "" | |||
for key, value in data.items(): | |||
spaces = " " * indent | |||
spaces = " " * indent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this was not on purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think four spaces are the default for nf-core modules.
Changes:
binarizeBams
andLearnModel
DYNAMITE:PREPROCESS
TF_TG_SCORE