Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apriori - I'm only interested in one product / consequent. Can I speed up the apriori algorithm? #658

Open
soliverc opened this issue Jan 22, 2020 · 6 comments

Comments

@soliverc
Copy link

For example, if I'm only interested in items frequently bought with eggs, is there a way to avoid generating itemsets for every other possible combination?
This takes a lot of time, and with the generated rule set I end up filtering the huge dataframe to find only rows where consequents == eggs, discarding everything else.
Can I speed this process up if I'm only interested in one consequent?

@rasbt
Copy link
Owner

rasbt commented Jan 22, 2020

Sorry, there's currently no option for that. A "must_contain=('some item', 'some other item', ...)" feature could be a useful enhancement though.

@harenbergsd
Copy link
Contributor

There are two things here. If you are only interested in frequent items containing some item, a "must_contain" type feature would be useful, and fairly straightforward to implement.

If you are interested in filtering on the consequent in rules, then that is more challenging. For example, say you are only interested in rules with consequent==eggs. You cannot mine only frequent patterns containing "eggs" because the rules rely on supports of other items. And, to get those, you need patterns which may not contain "eggs".

@rasbt
Copy link
Owner

rasbt commented May 20, 2020

I agree with you @harenbergsd . In addition, while adding a check for whether frequent itemsets contain a particular item may look useful in practice, I am not sure if that would necessarily speed things up (compared to pruning the resulting data frame). I think the major thing that it would accomplish is reducing the memory footprint of the resulting dataframe, because it really is just a check of whether an already computed itemset is to be added to the dataframe or not.

@harenbergsd
Copy link
Contributor

I found this blog post which I believe talks about this sort of thing: https://wimleers.com/article/fp-growth-powered-association-rule-mining-with-support-for-constraints

Like you, I do wonder how much perf improvement you could get by doing this sort of thing. It's certainly an interesting idea. You have to imagine, if you are only interested in a few consequents, you end up doing a lot of extra work and tossing stuff a way. But, it's tricky to know what is the extra work. Nice research project, depending on existing literature in this area :)

@kno10
Copy link

kno10 commented May 26, 2020

The performance of MLXtend "apriori" is pretty poor.
It does not implement the optimizations of the original Apriori algorithm such as the prefix join to avoid generating unnecessary candidates or the hash tree; but it is a quite naive version instead that generates way too many candidates. See #644 - the runtime of MLXtend was 2 minutes, of pyfim just 150 ms; so MLXtend was about 750x slower.
Hence if performance is a problem for you, consider using the well-known much faster "fim" module by Christian Borgelt: http://www.borgelt.net/pyfim.html - or improve the Apriori of MLXtend and first add the missing nice optimizations that are part of the Apriori algorithm.

@rasbt
Copy link
Owner

rasbt commented May 26, 2020

Yes, #646 should address that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants