Skip to content

Commit

Permalink
FPGrowth/FPMax and Association Rules with the existence of missing va…
Browse files Browse the repository at this point in the history
…lues (#1004) (#1106)

* Updated FPGrowth/FPMax and Association Rules with the existence of missing values

* Re-structure and document code

* Update unit tests

* Update CHANGELOG.md

* Modify the corresponding documentation in Jupyter notebooks

* Final modifications
  • Loading branch information
zazass8 authored Oct 23, 2024
1 parent d9713ea commit 11a295e
Show file tree
Hide file tree
Showing 10 changed files with 1,405 additions and 106 deletions.
11 changes: 9 additions & 2 deletions docs/sources/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,18 @@ The CHANGELOG for the current development version is available at

##### New Features and Enhancements

- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) Implemented three new metrics: Jaccard, Certainty, and Kulczynski. ([#1096](https://github.com/rasbt/mlxtend/issues/1096))
- Integrated scikit-learn's `set_output` method into `TransactionEncoder` ([#1087](https://github.com/rasbt/mlxtend/issues/1087) via [it176131](https://github.com/it176131))
- Implement the FP-Growth and FP-Max algorithms with the possibility of missing values in the input dataset. Added a new metric Representativity for the association rules generated ([#1004](https://github.com/rasbt/mlxtend/issues/1004) via [zazass8](https://github.com/zazass8)).
Files updated:
- ['mlxtend.frequent_patterns.fpcommon']
- ['mlxtend.frequent_patterns.fpgrowth'](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpgrowth/)
- ['mlxtend.frequent_patterns.fpmax'](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpmax/)
- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)
- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)Implemented three new metrics: Jaccard, Certainty, and Kulczynski. ([#1096](https://github.com/rasbt/mlxtend/issues/1096))
- Integrated scikit-learn's `set_output` method into `TransactionEncoder` ([#1087](https://github.com/rasbt/mlxtend/issues/1087) via[it176131](https://github.com/it176131))

##### Changes

- [`mlxtend.frequent_patterns.fpcommon`] Added the null_values parameter in valid_input_check signature to check in case the input also includes null values. Changes the returns statements and function signatures for setup_fptree and generated_itemsets respectively to return the disabled array created and to include it as a parameter. Added code in [`mlxtend.frequent_patterns.fpcommon`] and [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) to implement the algorithms in case null values exist when null_values is True.
- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) Added optional parameter 'return_metrics' to only return a given list of metrics, rather than every possible metric.

- Add `n_classes_` attribute to stacking classifiers for compatibility with scikit-learn 1.3 ([#1091](https://github.com/rasbt/mlxtend/issues/1091))
Expand Down
360 changes: 352 additions & 8 deletions docs/sources/user_guide/frequent_patterns/association_rules.ipynb

Large diffs are not rendered by default.

265 changes: 262 additions & 3 deletions docs/sources/user_guide/frequent_patterns/fpgrowth.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,9 @@
"\n",
"In general, the algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. An itemset is considered as \"frequent\" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur together in at least 50% of all transactions in the database.\n",
"\n",
"In particular, and what makes it different from the Apriori frequent pattern mining algorithm, FP-Growth is an frequent pattern mining algorithm that does not require candidate generation. Internally, it uses a so-called FP-tree (frequent pattern tree) datastrucure without generating the candidate sets explicitly, which makes it particularly attractive for large datasets."
"In particular, and what makes it different from the Apriori frequent pattern mining algorithm, FP-Growth is an frequent pattern mining algorithm that does not require candidate generation. Internally, it uses a so-called FP-tree (frequent pattern tree) datastrucure without generating the candidate sets explicitly, which makes it particularly attractive for large datasets.\n",
"\n",
"A new feature is implemented in this algorithm, which is the sub-case when the input contains missing information [3]. The same structure and logic of the algorithm is kept, while \"ignoring\" the missing values in the data. That gives a more realistic indication of the frequency of existence in the items/itemsets that are generated from the algorithm. The support is computed differently where for a single item, the cardinality of null values is deducted from the cardinality of all transactions in the database. For the case of an itemset, of more than one elements, the cardinality of null values in at least one item in them itemset is deducted from the cardinality of all transactions in the database. "
]
},
{
Expand All @@ -49,6 +51,8 @@
"\n",
"[2] Agrawal, Rakesh, and Ramakrishnan Srikant. \"[Fast algorithms for mining association rules](https://www.it.uu.se/edu/course/homepage/infoutv/ht08/vldb94_rj.pdf).\" Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994.\n",
"\n",
"[3] Ragel, A. and Crémilleux, B., 1998. \"[Treatment of missing values for association rules](https://link.springer.com/chapter/10.1007/3-540-64383-4_22)\". In Research and Development in Knowledge Discovery and Data Mining: Second Pacific-Asia Conference, PAKDD-98 Melbourne, Australia, April 15–17, 1998 Proceedings 2 (pp. 258-270). Springer Berlin Heidelberg.\n",
"\n",
"## Related\n",
"\n",
"- [FP-Max](./fpmax.md)\n",
Expand Down Expand Up @@ -479,6 +483,261 @@
"fpgrowth(df, min_support=0.6, use_colnames=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The example below implements the algorithm when there is missing information from the data, by arbitrarily removing datapoints from the original dataset."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Apple</th>\n",
" <th>Corn</th>\n",
" <th>Dill</th>\n",
" <th>Eggs</th>\n",
" <th>Ice cream</th>\n",
" <th>Kidney Beans</th>\n",
" <th>Milk</th>\n",
" <th>Nutmeg</th>\n",
" <th>Onion</th>\n",
" <th>Unicorn</th>\n",
" <th>Yogurt</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Apple Corn Dill Eggs Ice cream Kidney Beans Milk Nutmeg Onion \\\n",
"0 False False False True False True True True True \n",
"1 False NaN True True False True False True True \n",
"2 True False False True False True True False False \n",
"3 False True False False NaN NaN True NaN False \n",
"4 False True False True NaN True False False NaN \n",
"\n",
" Unicorn Yogurt \n",
"0 NaN NaN \n",
"1 False True \n",
"2 False False \n",
"3 NaN True \n",
"4 False False "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"from mlxtend.frequent_patterns import fpgrowth\n",
"\n",
"rows, columns = df.shape\n",
"idx = np.random.randint(0, rows, 10)\n",
"col = np.random.randint(0, columns, 10)\n",
"\n",
"for i in range(10):\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The same function as above is applied by setting `null_values=True` with at least 60% support:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>support</th>\n",
" <th>itemsets</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>(Kidney Beans)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.8</td>\n",
" <td>(Eggs)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.6</td>\n",
" <td>(Milk)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.0</td>\n",
" <td>(Eggs, Kidney Beans)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" support itemsets\n",
"0 1.0 (Kidney Beans)\n",
"1 0.8 (Eggs)\n",
"2 0.6 (Milk)\n",
"3 1.0 (Eggs, Kidney Beans)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fpgrowth(df, min_support=0.6, null_values = True, use_colnames=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -677,7 +936,7 @@
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
Expand All @@ -691,7 +950,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
"version": "3.12.7"
},
"toc": {
"nav_menu": {},
Expand Down
Loading

0 comments on commit 11a295e

Please sign in to comment.