-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mlxtend fpgrowth and association rules with the existence of missing values #1004
Comments
Hi, I just don't have the time to read through the paper and identify the needed changes. Thanks. |
I can see that dealing with itemsets that have missing values may be a common problem, and I am open to modifications. I currently don't have time to look at the paper in detail, but based on your description, it sounds like the general fpgrowth algorithm remains the same, and the change is primarily in how the existing metrics are computed, plus the new "representativity" metric? Before getting to that, I think a key consideration is also how to represent missing values. Given that the current version works with Bool data types, we probably have to change certain things. I don't think that pandas currently support NaN values for Bool data, so maybe we would have to change it to an integer representation, which could have performance implications. I think the first step here is to think about what the input datatype and interface would look like. Tricky 🤔 |
Yes, exactly the algorithm remains the same except from the modification of the metrics that are computed. And yes, it includes the rule of representativity as well. As I have mentioned above, I commented out some lines on the valid_input_check(df) function where it checks if the data in the df are boolean or binary type. I believe that by commenting out that, it should be fine? But still, as the author of the code you know better for this and maybe your suggestion is better. |
Describe the workflow you want to enable
I am trying to implement the fpgrowth algorithm on a dataset that I am working on. I do know for a fact that the input fpgrowth only accepts is in binary format (e.g. True, False or 1's,0's). My dataset is in binary format but I have some missing values within it. I can't really drop those missing values because that will result in dropping useful binary datapoints from the rest.
Describe your proposed solution
What I thought of doing, is to compute support and confidence metrics based on the frequency of items that exist in that binary map AND by also ignoring the missing values. I believe that would be doable and I did find how it's done from this paper (https://weber.itn.liu.se/~aidvi/courses/06/dm/papers/AssociationRules/RAR(98).pdf). The formulas for support and confidence are quite different, mainly they subtract the count of the missing values from the denominator of the support/confidence. They also include a new metric known as 'representativity' and it is an additional criterium that helps in determining better which association rules are the strongest ones.
Describe alternatives you've considered, if relevant
I tried to make a few edits on the open source code you have on github, but I got lost while reading it, it's a very abstract code to be honest. Because, you have been using a lot of recursions as well.
If it is possible you could help me to make these edits I will very much appreciate it.
Additional context
The text was updated successfully, but these errors were encountered: