Merge pull request #33 from JaredSchwartz/Overhaul-Documentation

Refine Documentation
JaredSchwartz · Aug 9, 2024 · 10726ba · 10726ba
2 parents 5f62d8b + 8a817ed
commit 10726ba
Show file tree

Hide file tree

Showing 15 changed files with 156 additions and 52 deletions.
diff --git a/README.md b/README.md
@@ -1,3 +1,7 @@
+<p align="center">
+<img width="400px" src="./docs/src/assets/logo.svg" title="RuleMiner.jl logo"/>
+</p>
+
 # RuleMiner.jl - Association Rule Mining in Julia
 [![Build Status](https://github.com/JaredSchwartz/RuleMiner.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/JaredSchwartz/RuleMiner.jl/actions/workflows/CI.yml?query=branch%3Amain)
 [![codecov](https://codecov.io/github/JaredSchwartz/RuleMiner.jl/graph/badge.svg?token=KDAVR32F6S)](https://codecov.io/github/JaredSchwartz/RuleMiner.jl)

diff --git a/docs/src/images/Closed.png → docs/src/assets/closed.png b/docs/src/images/Closed.png → docs/src/assets/closed.png
diff --git a/docs/src/images/Frequent.png → docs/src/assets/frequent.png b/docs/src/images/Frequent.png → docs/src/assets/frequent.png
diff --git a/docs/src/assets/logo.svg b/docs/src/assets/logo.svg
diff --git a/docs/src/images/Maximal.png → docs/src/assets/maximal.png b/docs/src/images/Maximal.png → docs/src/assets/maximal.png
diff --git a/docs/src/association_rules.md b/docs/src/association_rules.md
@@ -2,9 +2,9 @@
 
 ## Description
 
-Association rule mining is a fundamental technique in data mining and machine learning that aims to uncover interesting relationships, correlations, or patterns within large datasets. Originally developed for market basket analysis in retail, it has since found applications in various fields such as web usage mining, intrusion detection, and bioinformatics. The primary goal of association rule mining is to identify strong rules discovered in databases using different measures of interestingness.
+Association rule mining is a fundamental technique in data mining and machine learning that aims to uncover interesting relationships, correlations, or patterns within large datasets. Originally developed for market basket analysis in retail, it has since found applications in various fields such as web usage mining, intrusion detection, and bioinformatics. The primary goal of association rule mining is to identify strong association rules rules discovered in databases using different measures of interestingness like support, confidence, coverage, and lift.
 
-At its core, association rule mining works by examining frequent if-then patterns in transactional databases. These patterns, known as association rules, take the form "if A, then B," where A and B are sets of items. For example, in a supermarket context, a rule might be "if a customer buys bread and butter, they are likely to buy milk." The strength of these rules is typically measured by support (how frequently the items appear together), confidence (how often the rule is found to be true), and lift (the ratio of observed support to expected support if A and B were independent). By setting minimum thresholds for these metrics, analysts can filter out weak or uninteresting rules and focus on those that are most likely to provide valuable insights or actionable information.
+At its core, association rule mining works by examining frequent if-then patterns in transactional databases. These patterns, known as association rules, take the form "if A, then B," where A and B are sets of items. For example, in a supermarket context, a rule might be "if a customer buys bread and butter, they are likely to buy milk." The strength of these rules is typically measured by support (how frequently the items appear together), confidence (how often the rule is found to be true), converage(how often B ocurrs in in the databse with or without A), and lift (the ratio of observed support to expected support if A and B were independent). By filtering association rules with these metrics, analysts can filter out weak or uninteresting rules and focus on those that are most likely to provide valuable insights or actionable information.
 
 ## Formal Definition
 Let:
@@ -22,7 +22,8 @@ For a given rule ``A \Rightarrow B``, these measures are defined:
 
 - Support: ``\sigma(A \Rightarrow B) = \frac{|{T_j \in D : A \cup B \subseteq T_j}|}{|D|}``
 - Confidence: ``\chi(A \Rightarrow B) = \frac{\sigma(A \cup B)}{\sigma(A)}``
-- Lift: ``\gamma(A \Rightarrow B) = \frac{\sigma(A \cup B)}{\sigma(A) \cdot \sigma(B)}``
+- Coverage: ``\gamma(A \Rightarrow B) = \sigma(B) = \frac{|{T_j \in D : B \subseteq T_j}|}{|D|}``
+- Lift: ``L(A \Rightarrow B) = \frac{\sigma(A \cup B)}{\sigma(A) \cdot \sigma(B)}``
 
 Let ``\sigma_{min}`` and ``\chi_{min}`` be user-defined minimum thresholds for support and confidence, respectively.
 

diff --git a/docs/src/closed_itemsets.md b/docs/src/closed_itemsets.md
@@ -1,11 +1,11 @@
 # Closed Itemset Mining
 
-![image](./images/Closed.png)
+![Diagram showing maximal itemsets as a subset of closed itemsets which are a subset of frequent itemsets](./assets/closed.png)
 ## Description
 
-Closed itemset mining is a set of techniques focused on discovering closed itemsets in a transactional dataset. A closed itemset is one which appears frequently in the data (above the minimum support threshold) and which has no superset with the same support. In other words, closed itemsets are the largest possible combinations of items that occur in exactly the same transactions. They represent a lossless compression of the set of all frequent itemsets, as the support of any frequent itemset can be derived from the closed itemsets. 
+Closed itemset mining is a set of techniques focused on discovering closed itemsets in a transactional dataset. A closed itemset is one which appears frequently in the data (above the minimum support threshold) and which has no superset with the same support. In other words, closed itemsets are the largest possible combinations of items that share the same transactions. They represent a lossless compression of the set of all frequent itemsets, as the support of any frequent itemset can be derived from the closed itemsets. 
 
-The key advantage of mining closed itemsets is that it provides a compact yet complete representation of all frequent patterns in the data. By identifying only the closed frequent itemsets, the number of patterns generated is significantly reduced compared to mining all frequent itemsets while still retaining all support information. This approach strikes a balance between the compactness of maximal itemsets and the completeness of all frequent itemsets. It is particularly useful in scenarios where both the frequency and the exact composition of itemsets are important, such as in certain types of association rule mining or when analyzing dense datasets.
+The key advantage of mining closed itemsets is that it provides a compact yet complete representation of all frequent patterns in the data. By identifying only the closed frequent itemsets, the number of patterns generated is significantly reduced compared to mining all frequent itemsets while still retaining all support information. This approach strikes a balance between the compactness of maximal itemsets and the completeness of all frequent itemsets. Closed itemset mining is particularly useful in scenarios where both the frequency and the exact composition of itemsets are important, but compression of the results is desired.
 
 ## Formal Definition
 Let:
@@ -37,39 +37,36 @@ CFI = {X \mid X \subseteq I \wedge \sigma(X) \geq \sigma_{min} \wedge \nexists Y
 Closed itemsets can be used to recover all frequent itemsets by generating combinations from the mined itemsets along with their supports. This can be accomplished thorugh the levelwise algorithm proposed by Pasquier et al. in 1999.
 
 
-The `levelwise` function implements the levelwise algorithm for recovering frequent itemsets from closed itemsets. This algorithm generates all subsets of the closed itemsets, derives their supports, and then returns the results. This particular implementation is designed to take a result datafrom from the various closed itemset mining algorithms in this package as its input and thus can only return absolute support (`N`), rather than both relative support and absolute support.
+The `levelwise` function implements the levelwise algorithm for recovering frequent itemsets from closed itemsets. This algorithm generates all subsets of the closed itemsets, derives their supports, and then returns the results. This particular implementation is designed to take an output `DataFrame` from the various closed itemset mining algorithms in this package. Without the original transactions dataset, its input and return values can only handle absolute support (`N`), rather than both relative support and absolute support.
 
 ```@docs
 levelwise(df::DataFrame, min_n::Int)
 ```
 ## Algorithms
 
 ### CHARM
-
-The `charm` function implements the CHARM ([C]losed, [H]ash-based [A]ssociation [R]ule [M]ining) algorithm for mining closed itemsets proposed by Mohammad Zaki and Ching-Jui Hsiao in 2002. This algorithm uses a depth-first search with hash-based approaches to pruning non-closed itemsets and is particularly efficient for dense datasets.
+The `charm` function implements the CHARM ([C]losed, [H]ash-based [A]ssociation [R]ule [M]ining) algorithm for mining closed itemsets proposed by Mohammad Zaki and Ching-Jui Hsiao in 2002. This algorithm uses a depth-first search with hash-based pruning approaches for non-closed itemsets and is particularly efficient for sparse datasets.
 
 ```@docs
 charm(txns::Transactions, min_support::Union{Int,Float64})
 ```
 
 ### FPClose
-The `fpclose` function implements the FPClose ([F]requent [P]attern Close) algorithm for mining closed itemsets. This algorithm, proposed by Gösta Grahne and Jianfei Zhu in 2005, builds on the FP-Growth alogrithm to discover closed itemsets in a dataset without candidate generation.
+The `fpclose` function implements the FPClose ([F]requent [P]attern Close) algorithm for mining closed itemsets. This algorithm, proposed by Gösta Grahne and Jianfei Zhu in 2005, builds on the FP-Growth alogrithm to discover closed itemsets in a dataset without candidate generation. It inherits many of the advantages of FP-Growth when it comes to dense datasets.
 
 ```@docs
 fpclose(txns::Transactions, min_support::Union{Int,Float64})
 ```
-### LCM
-
-The `LCM` function implements the LCM ([L]inear-time [C]losed [M]iner) algorithm for mining frequent closed itemsets first proposed by Uno et al. in 2004. This is an efficient method for discovering closed itemsets in a dataset with a linear time complexity.
 
+### LCM
+The `LCM` function implements the LCM ([L]inear-time [C]losed [M]iner) algorithm for mining frequent closed itemsets first proposed by Uno et al. in 2004. This is an efficient method for discovering closed itemsets in a dataset with a linear time complexity. It is typically faster than other algorithms and has a more balance profile that achieves fast mining on both sparse and dense datasets.
 
 ```@docs
    LCM(txns::Transactions, min_support::Union{Int,Float64})
 ```
 
 ### CARPENTER
-
-The `carpenter` function implements the CARPENTER ([C]losed [P]att[e]r[n] Discovery by [T]ransposing Tabl[e]s that a[r]e Extremely Long) algorithm for mining closed itemsets proposed by Pan et al. in 2003. This algorithm uses a transposed structure to optimize for datasets that have far more items than transactions, such as those found in genetic research and bioinformatics. It may not be the best choice if your data does not fit that format.
+The `carpenter` function implements the CARPENTER ([C]losed [P]att[e]r[n] Discovery by [T]ransposing Tabl[e]s that a[r]e Extremely Long) algorithm for mining closed itemsets proposed by Pan et al. in 2003. This algorithm uses a transposed structure to optimize for datasets that have far more items than transactions, such as those found in genetic research and bioinformatics. It is not well suited to datasets in the more standard transaction-major format.
 
 ```@docs
 carpenter(txns::Transactions, min_support::Union{Int,Float64})

diff --git a/docs/src/frequent_itemsets.md b/docs/src/frequent_itemsets.md
@@ -1,11 +1,11 @@
 # Frequent Itemset Mining
 
-![image](./images/Frequent.png)
+![Diagram showing maximal itemsets as a subset of closed itemsets which are a subset of frequent itemsets](./assets/frequent.png)
 ## Description
 
 Frequent itemset mining is a fundamental technique in data mining focused on discovering itemsets that appear frequently in a transactional dataset. A frequent itemset is a set of items that occurs together in the data with a frequency no less than a specified minimum support threshold. Frequent itemsets form the basis of various data mining tasks, including association rule mining, sequential pattern mining, and correlation analysis.
 
-The one caveat is that depenind on the support parameter and the structure of the data, these mining techniques can yield large numbers of patterns, especially in dense datasets or with low support thresholds. This challenge has led to the development of more concise representations like closed and maximal itemset mining.
+The one caveat with frequent itemset mining is that depending on the support parameter and the structure of the data, these mining techniques can yield large numbers of patterns, especially in dense datasets or with low support thresholds. This challenge has led to the development of more concise representations like closed and maximal itemset mining.
 
 ## Formal Definition
 Let:
@@ -30,18 +30,18 @@ FI = {X \mid X \subseteq I \wedge \sigma(X) \geq \sigma_{min}}
 
 ## Algorithms
 
-### FP-Growth
+### ECLAT
 
-The `fpgrowth` function implements the FP-Growth ([F]requent [P]attern Growth) algorithm for mining frequent itemsets. This algorithm, proposed by Han et al. in 2000, is an efficient method for discovering frequent itemsets in a dataset without candidate generation. It is generally more efficient than other algorithms when transactions have large numbers of items
+The `eclat` function implements the [E]quivalence [CLA]ss [T]ransformation algorithm for frequent itemset mining proposed by Mohammad Zaki in 2000. This algorithm identifies frequent itemsets in a dataset utilizing a column-first search and supplied minimum support.
 
 ```@docs
-fpgrowth(txns::Transactions, min_support::Union{Int,Float64})
+eclat(txns::Transactions, min_support::Union{Int,Float64})
 ```
 
-### ECLAT
+### FP-Growth
 
-The `eclat` function implements the [E]quivalence [CLA]ss [T]ransformation algorithm for frequent itemset mining proposed by Mohammad Zaki in 2000. This algorithm identifies frequent itemsets in a dataset utilizing a column-first search and supplied minimum support.
+The `fpgrowth` function implements the FP-Growth ([F]requent [P]attern Growth) algorithm for mining frequent itemsets. This algorithm, proposed by Han et al. in 2000, is an efficient method for discovering frequent itemsets in a dataset without candidate generation. It is generally more efficient than other algorithms when datasets are dense, as the internal FP tree data structure it builds efficiently summarizes the relationships and supports of the itemsets.
 
 ```@docs
-eclat(txns::Transactions, min_support::Union{Int,Float64})
+fpgrowth(txns::Transactions, min_support::Union{Int,Float64})
 ```