FPGrowth/FPMax and Association Rules with the existence of missing va…

…lues (#1004) (#1106) * Updated FPGrowth/FPMax and Association Rules with the existence of missing values * Re-structure and document code * Update unit tests * Update CHANGELOG.md * Modify the corresponding documentation in Jupyter notebooks * Final modifications
rasbt · Oct 23, 2024 · 11a295e · 11a295e
1 parent d9713ea
commit 11a295e
Show file tree

Hide file tree

Showing 10 changed files with 1,405 additions and 106 deletions.
diff --git a/docs/sources/CHANGELOG.md b/docs/sources/CHANGELOG.md
@@ -18,11 +18,18 @@ The CHANGELOG for the current development version is available at
 
 ##### New Features and Enhancements
 
-- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) Implemented three new metrics: Jaccard, Certainty, and Kulczynski. ([#1096](https://github.com/rasbt/mlxtend/issues/1096))
-- Integrated scikit-learn's `set_output` method into `TransactionEncoder` ([#1087](https://github.com/rasbt/mlxtend/issues/1087) via [it176131](https://github.com/it176131))
+-  Implement the FP-Growth and FP-Max algorithms with the possibility of missing values in the input dataset. Added a new metric Representativity for the association rules generated ([#1004](https://github.com/rasbt/mlxtend/issues/1004) via [zazass8](https://github.com/zazass8)).
+Files updated:
+  - ['mlxtend.frequent_patterns.fpcommon']
+  - ['mlxtend.frequent_patterns.fpgrowth'](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpgrowth/)
+  - ['mlxtend.frequent_patterns.fpmax'](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpmax/)
+  - [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)
+- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)Implemented three new metrics: Jaccard, Certainty, and Kulczynski. ([#1096](https://github.com/rasbt/mlxtend/issues/1096))
+- Integrated scikit-learn's `set_output` method into `TransactionEncoder` ([#1087](https://github.com/rasbt/mlxtend/issues/1087) via[it176131](https://github.com/it176131))
 
 ##### Changes
 
+- [`mlxtend.frequent_patterns.fpcommon`] Added the null_values parameter in valid_input_check signature to check in case the input also includes null values. Changes the returns statements and function signatures for setup_fptree and generated_itemsets respectively to return the disabled array created and to include it as a parameter. Added code in [`mlxtend.frequent_patterns.fpcommon`] and [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) to implement the algorithms in case null values exist when null_values is True.
 - [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) Added optional parameter 'return_metrics' to only return a given list of metrics, rather than every possible metric.
 
 - Add `n_classes_` attribute to stacking classifiers for compatibility with scikit-learn 1.3 ([#1091](https://github.com/rasbt/mlxtend/issues/1091))

diff --git a/docs/sources/user_guide/frequent_patterns/association_rules.ipynb b/docs/sources/user_guide/frequent_patterns/association_rules.ipynb
diff --git a/docs/sources/user_guide/frequent_patterns/fpgrowth.ipynb b/docs/sources/user_guide/frequent_patterns/fpgrowth.ipynb
@@ -36,7 +36,9 @@
     "\n",
     "In general, the algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. An itemset is considered as \"frequent\" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur together in at least 50% of all transactions in the database.\n",
     "\n",
-    "In particular, and what makes it different from the Apriori frequent pattern mining algorithm, FP-Growth is an frequent pattern mining algorithm that does not require candidate generation. Internally, it uses a so-called FP-tree (frequent pattern tree) datastrucure without generating the candidate sets explicitly, which makes it particularly attractive for large datasets."
+    "In particular, and what makes it different from the Apriori frequent pattern mining algorithm, FP-Growth is an frequent pattern mining algorithm that does not require candidate generation. Internally, it uses a so-called FP-tree (frequent pattern tree) datastrucure without generating the candidate sets explicitly, which makes it particularly attractive for large datasets.\n",
+    "\n",
+    "A new feature is implemented in this algorithm, which is the sub-case when the input contains missing information [3]. The same structure and logic of the algorithm is kept, while \"ignoring\" the missing values in the data. That gives a more realistic indication of the frequency of existence in the items/itemsets that are generated from the algorithm. The support is computed differently where for a single item, the cardinality of null values is deducted from the cardinality of all transactions in the database. For the case of an itemset, of more than one elements, the cardinality of null values in at least one item in them itemset is deducted from the cardinality of all transactions in the database. "
    ]
   },
   {
@@ -49,6 +51,8 @@
     "\n",
     "[2] Agrawal, Rakesh, and Ramakrishnan Srikant. \"[Fast algorithms for mining association rules](https://www.it.uu.se/edu/course/homepage/infoutv/ht08/vldb94_rj.pdf).\" Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994.\n",
     "\n",
+    "[3] Ragel, A. and Crémilleux, B., 1998. \"[Treatment of missing values for association rules](https://link.springer.com/chapter/10.1007/3-540-64383-4_22)\". In Research and Development in Knowledge Discovery and Data Mining: Second Pacific-Asia Conference, PAKDD-98 Melbourne, Australia, April 15–17, 1998 Proceedings 2 (pp. 258-270). Springer Berlin Heidelberg.\n",
+    "\n",
     "## Related\n",
     "\n",
     "- [FP-Max](./fpmax.md)\n",
@@ -479,6 +483,261 @@
     "fpgrowth(df, min_support=0.6, use_colnames=True)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The example below implements the algorithm when there is missing information from the data, by arbitrarily removing datapoints from the original dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
+      "  df.iloc[idx[i], col[i]] = np.nan\n",
+      "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
+      "  df.iloc[idx[i], col[i]] = np.nan\n",
+      "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
+      "  df.iloc[idx[i], col[i]] = np.nan\n",
+      "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
+      "  df.iloc[idx[i], col[i]] = np.nan\n",
+      "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
+      "  df.iloc[idx[i], col[i]] = np.nan\n",
+      "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
+      "  df.iloc[idx[i], col[i]] = np.nan\n",
+      "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
+      "  df.iloc[idx[i], col[i]] = np.nan\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Apple</th>\n",
+       "      <th>Corn</th>\n",
+       "      <th>Dill</th>\n",
+       "      <th>Eggs</th>\n",
+       "      <th>Ice cream</th>\n",
+       "      <th>Kidney Beans</th>\n",
+       "      <th>Milk</th>\n",
+       "      <th>Nutmeg</th>\n",
+       "      <th>Onion</th>\n",
+       "      <th>Unicorn</th>\n",
+       "      <th>Yogurt</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>True</td>\n",
+       "      <td>True</td>\n",
+       "      <td>True</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>False</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>True</td>\n",
+       "      <td>True</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>True</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>True</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>True</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>True</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>False</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>True</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   Apple   Corn   Dill   Eggs Ice cream Kidney Beans   Milk Nutmeg  Onion  \\\n",
+       "0  False  False  False   True     False         True   True   True   True   \n",
+       "1  False    NaN   True   True     False         True  False   True   True   \n",
+       "2   True  False  False   True     False         True   True  False  False   \n",
+       "3  False   True  False  False       NaN          NaN   True    NaN  False   \n",
+       "4  False   True  False   True       NaN         True  False  False    NaN   \n",
+       "\n",
+       "  Unicorn Yogurt  \n",
+       "0     NaN    NaN  \n",
+       "1   False   True  \n",
+       "2   False  False  \n",
+       "3     NaN   True  \n",
+       "4   False  False  "
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import numpy as np\n",
+    "from mlxtend.frequent_patterns import fpgrowth\n",
+    "\n",
+    "rows, columns = df.shape\n",
+    "idx = np.random.randint(0, rows, 10)\n",
+    "col = np.random.randint(0, columns, 10)\n",
+    "\n",
+    "for i in range(10):\n",
+    "    df.iloc[idx[i], col[i]] = np.nan\n",
+    "\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The same function as above is applied by setting `null_values=True` with at least 60% support:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>support</th>\n",
+       "      <th>itemsets</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1.0</td>\n",
+       "      <td>(Kidney Beans)</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>0.8</td>\n",
+       "      <td>(Eggs)</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>0.6</td>\n",
+       "      <td>(Milk)</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1.0</td>\n",
+       "      <td>(Eggs, Kidney Beans)</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   support              itemsets\n",
+       "0      1.0        (Kidney Beans)\n",
+       "1      0.8                (Eggs)\n",
+       "2      0.6                (Milk)\n",
+       "3      1.0  (Eggs, Kidney Beans)"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "fpgrowth(df, min_support=0.6, null_values = True, use_colnames=True)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -677,7 +936,7 @@
  "metadata": {
   "anaconda-cloud": {},
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
@@ -691,7 +950,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.10"
+   "version": "3.12.7"
   },
   "toc": {
    "nav_menu": {},