Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If there is only one converged run, the sivs will break #2

Open
mmahmoudian opened this issue Oct 20, 2022 · 1 comment
Open

If there is only one converged run, the sivs will break #2

mmahmoudian opened this issue Oct 20, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@mmahmoudian
Copy link
Owner

mmahmoudian commented Oct 20, 2022

There is one very rare condition that sivs() does not catch and it will break without a proper error. In this post/issue I'm going to first dissect the way this issue can happen, and then I will list how this can be addressed, and ultimately I elaborate on my preference on how this should be addressed.

The reasons I'm writing this are:

  • be transparent
  • give the possibility to the users and community to engage and express their opinion, and to discuss solutions
  • have a good and comprehensive documentation on this extremely rare issue (one single commit would not do justice to the amount of time I put into investigating this issue. I only received an email with an error and I didn't have access to the input data to investigate it, so I had to go through many hypotheses and run part of the code in my head with imaginary data to test those hypotheses).

Dissection of the problem

In the following line we are removing the runs that have failed to converge (due to whatever reason that is relevant to the chosen method)

sivs/R/sivs.R

Lines 1010 to 1011 in 4a57625

# remove the runs that could not converge (and as the result have NA instead of coefficients' dataframe)
clean.iterative.res <- iterative.res[!sapply(lapply(iterative.res, "[[", "coef"), is.logical)]

and in the following lines we are making sure that we have at least something to proceed with:

sivs/R/sivs.R

Lines 1013 to 1019 in 4a57625

# if there were not any runs that have converged
if(length(clean.iterative.res) == 0){
stop("In the Iterative step, none of the ", length(iterative.res),
" runs got converged. This means that sivs cannot progress to",
" subsequent steps, and you should use different",
" 'method' or normalize your data differently.")
}

and if we have at least one good run, we proceed with this beautiful and compact part to extract the coefficients in a data.frame format:

sivs/R/sivs.R

Lines 1021 to 1030 in 4a57625

coef.df <- Reduce(function(...){ merge(...,
by = "names",
all = TRUE) },
sapply(names(clean.iterative.res),
FUN = function(item) {
temp <- clean.iterative.res[[item]]$coef
colnames(temp)[2] <- paste0("coef.", item)
return(temp)
},
simplify = FALSE))

The issue raises when clean.iterative.res has length of 1, which subsequently would cause the coef.df to be a data.frame with only two columns. When we move the row names to row.names, the data.frame would be left with only one column which consequently turn into a vector:

sivs/R/sivs.R

Lines 1032 to 1034 in 4a57625

## set the feature names to be the rownames
row.names(coef.df) <- coef.df[, 1]
coef.df <- coef.df[, -1]

This can be reproduced with this small demo:

# only keeping two columns
a <- iris[, 1:2]
# convert the first column to some unique character values
a[, 1] <- paste0("a", 1:nrow(a))
# the following two lines is exactly what we do in the code chuck above (sivs.R#L1032-L1034)
row.names(a) <- a[, 1]
a <- a[, -1]
> a
  [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1
 [36] 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5
 [71] 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0
[106] 3.0 2.5 2.9 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1
[141] 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0

And this is the problem, because the following apply() expects a tabular object but what it will receive is just a vector:

sivs/R/sivs.R

Lines 1036 to 1039 in 4a57625

# calculate how many times each feature has been given a coefficient other than 0 and them sort them decreasingly
selection.freq <- sort(x = apply(coef.df, 1, function(f){
sum(f != 0)
}), decreasing = TRUE)


Possible solutions and my thoughts

There are many ways to solve this programmatically, like converting the coef.df to a matrix or even using the tibble package for all tabular data we have in the SIVS package; but the question is more fundamental imho (when nfolds > 1) :

  • If we end up with one single "converged" run, should we even consider moving forward?
  • Is this one single run a reliable run? (because it has worked in only one particular sample configuration)
  • Should we do the Recursive Feature Elimination step using this one "lucky" run?

I would argue that the RFE should not be ran, because having one single lucky run is well against having a reproducible research and what "Stable Iterative Variable Selection" is all about.

If we decide not to move to the next step in SIVS (i.e recursive feature elimination), then the question would be on how this should be reported to user and what should we return? Possible options are

  1. Throw an error and not returning anything to user (even that one lucky run)
  2. Throw a warning and return whatever we have so far

I would argue that we should throw an error and do not return anything to user. This is because, in spite of writing it in the article that SIVS should not be treated as a black-box method, many naively would still run the code and expect magic, and in the off chance of hitting this "lucky" run, they will just go with what SIVS is providing and they will move forward without thinking twice about what has happened and why. Perhaps when #1 is implemented, the user can have access to the list of coefficients and can do some sort of RFE themselves at their own risk if they want to.

Reports

This issue has been reported on Stackoverflow as well:

Acknowledgment

I would like to thank Alexander Biehl who reported this issue. I will edit this post when/if I get their consent to put their real name or GithubID here. As I mentioned, this is a very rare case and Alex responsibly and kindly reported the issue to me via email. The time he invested in writing that email should also be factored in for resolution of this issue. Also Alex pointed me to the Stackoverflow post which I mentioned above.

@mmahmoudian
Copy link
Owner Author

Alex and I exchanged few emails, but I thought this part of his last email should be posted here as this is directly contributing towards addressing this issue:

I think I agree towards your conclusion that it should probably not proceed towards rfe, as there was only one successful run.
However on the other hand that also raises the question, on how many runs should be successful before moving on to the rfe step, as even with only 2 successful runs, it does also fall under "against having a reproducible research and what "Stable Iterative Variable Selection" is all about", as you wrote.
But that would mean deciding on either a hard number or a percentage of overall runs, that should always complete, before moving on to rfe.
This could also be an argument option (with a default of e.g. 50% or similar), then the users can still change it (and hopefully are aware of the meaning).
As my suggestion below, this is however the most work-intensive option for you regarding implementation.

Either way, on how you report this. Arguments can be made for both sides. Assuming a "naive user", an error might lead them to assume SIVS does not work and move to a different feature elimination method, not solving their problem. While a warning might just be ignored (or even missed, depending on the implementation - I for one used SIVS in leave-pair-out cross-validation with many thousand runs of sivs, I would not even know if I would see the warnings then).

One approach that is maybe a good middle-ground, but also possibly the most work for you to implement, would be to throw an custom error explaining the problem, however also to give an argument option to instead just throw a warning? The error could reference that argument option, so people could still run SIVS if they insist on it - and you at least did every possible thing you could regarding incentivicing them to acknowledge the problem, but still don't necessarily drive them to other feature selection methods that generally are not as thorough as SIVS in their design towards reproducible research.

Both these suggestions are of course questionable regarding implementation_work-usefulness balance as this could very well be a rare edge-case as you mentioned, plus at least the first suggestion might change previous SIVS results (where unknowingly some results might have had only 2 successful runs, although I would expect this also does happen quite rarely - I ofc don't have data to back that up though).

Example pseudo-code would be (including both suggestions):

sivs::sivs(x = data, [...], converged_run_perc = 0.5, error_when_too_few_runs = TRUE)

[...]
if ((attempted_runs / converged_runs) < converged_run_perc):
    if (error_when_too_few_runs):
        raise_error() # explains source and the implications, but mentions error_when_too_few_runs option still
    else:
        raise_warning() # explains source and the implications
else:
    continue with RFE step

@mmahmoudian mmahmoudian added the bug Something isn't working label Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant