Skip to content

Commit

Permalink
Merge pull request #33 from JaredSchwartz/Overhaul-Documentation
Browse files Browse the repository at this point in the history
Refine Documentation
  • Loading branch information
JaredSchwartz authored Aug 9, 2024
2 parents 5f62d8b + 8a817ed commit 10726ba
Show file tree
Hide file tree
Showing 15 changed files with 156 additions and 52 deletions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
<p align="center">
<img width="400px" src="./docs/src/assets/logo.svg" title="RuleMiner.jl logo"/>
</p>

# RuleMiner.jl - Association Rule Mining in Julia
[![Build Status](https://github.com/JaredSchwartz/RuleMiner.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/JaredSchwartz/RuleMiner.jl/actions/workflows/CI.yml?query=branch%3Amain)
[![codecov](https://codecov.io/github/JaredSchwartz/RuleMiner.jl/graph/badge.svg?token=KDAVR32F6S)](https://codecov.io/github/JaredSchwartz/RuleMiner.jl)
Expand Down
File renamed without changes
File renamed without changes
98 changes: 98 additions & 0 deletions docs/src/assets/logo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
7 changes: 4 additions & 3 deletions docs/src/association_rules.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

## Description

Association rule mining is a fundamental technique in data mining and machine learning that aims to uncover interesting relationships, correlations, or patterns within large datasets. Originally developed for market basket analysis in retail, it has since found applications in various fields such as web usage mining, intrusion detection, and bioinformatics. The primary goal of association rule mining is to identify strong rules discovered in databases using different measures of interestingness.
Association rule mining is a fundamental technique in data mining and machine learning that aims to uncover interesting relationships, correlations, or patterns within large datasets. Originally developed for market basket analysis in retail, it has since found applications in various fields such as web usage mining, intrusion detection, and bioinformatics. The primary goal of association rule mining is to identify strong association rules rules discovered in databases using different measures of interestingness like support, confidence, coverage, and lift.

At its core, association rule mining works by examining frequent if-then patterns in transactional databases. These patterns, known as association rules, take the form "if A, then B," where A and B are sets of items. For example, in a supermarket context, a rule might be "if a customer buys bread and butter, they are likely to buy milk." The strength of these rules is typically measured by support (how frequently the items appear together), confidence (how often the rule is found to be true), and lift (the ratio of observed support to expected support if A and B were independent). By setting minimum thresholds for these metrics, analysts can filter out weak or uninteresting rules and focus on those that are most likely to provide valuable insights or actionable information.
At its core, association rule mining works by examining frequent if-then patterns in transactional databases. These patterns, known as association rules, take the form "if A, then B," where A and B are sets of items. For example, in a supermarket context, a rule might be "if a customer buys bread and butter, they are likely to buy milk." The strength of these rules is typically measured by support (how frequently the items appear together), confidence (how often the rule is found to be true), converage(how often B ocurrs in in the databse with or without A), and lift (the ratio of observed support to expected support if A and B were independent). By filtering association rules with these metrics, analysts can filter out weak or uninteresting rules and focus on those that are most likely to provide valuable insights or actionable information.

## Formal Definition
Let:
Expand All @@ -22,7 +22,8 @@ For a given rule ``A \Rightarrow B``, these measures are defined:

- Support: ``\sigma(A \Rightarrow B) = \frac{|{T_j \in D : A \cup B \subseteq T_j}|}{|D|}``
- Confidence: ``\chi(A \Rightarrow B) = \frac{\sigma(A \cup B)}{\sigma(A)}``
- Lift: ``\gamma(A \Rightarrow B) = \frac{\sigma(A \cup B)}{\sigma(A) \cdot \sigma(B)}``
- Coverage: ``\gamma(A \Rightarrow B) = \sigma(B) = \frac{|{T_j \in D : B \subseteq T_j}|}{|D|}``
- Lift: ``L(A \Rightarrow B) = \frac{\sigma(A \cup B)}{\sigma(A) \cdot \sigma(B)}``

Let ``\sigma_{min}`` and ``\chi_{min}`` be user-defined minimum thresholds for support and confidence, respectively.

Expand Down
21 changes: 9 additions & 12 deletions docs/src/closed_itemsets.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Closed Itemset Mining

![image](./images/Closed.png)
![Diagram showing maximal itemsets as a subset of closed itemsets which are a subset of frequent itemsets](./assets/closed.png)
## Description

Closed itemset mining is a set of techniques focused on discovering closed itemsets in a transactional dataset. A closed itemset is one which appears frequently in the data (above the minimum support threshold) and which has no superset with the same support. In other words, closed itemsets are the largest possible combinations of items that occur in exactly the same transactions. They represent a lossless compression of the set of all frequent itemsets, as the support of any frequent itemset can be derived from the closed itemsets.
Closed itemset mining is a set of techniques focused on discovering closed itemsets in a transactional dataset. A closed itemset is one which appears frequently in the data (above the minimum support threshold) and which has no superset with the same support. In other words, closed itemsets are the largest possible combinations of items that share the same transactions. They represent a lossless compression of the set of all frequent itemsets, as the support of any frequent itemset can be derived from the closed itemsets.

The key advantage of mining closed itemsets is that it provides a compact yet complete representation of all frequent patterns in the data. By identifying only the closed frequent itemsets, the number of patterns generated is significantly reduced compared to mining all frequent itemsets while still retaining all support information. This approach strikes a balance between the compactness of maximal itemsets and the completeness of all frequent itemsets. It is particularly useful in scenarios where both the frequency and the exact composition of itemsets are important, such as in certain types of association rule mining or when analyzing dense datasets.
The key advantage of mining closed itemsets is that it provides a compact yet complete representation of all frequent patterns in the data. By identifying only the closed frequent itemsets, the number of patterns generated is significantly reduced compared to mining all frequent itemsets while still retaining all support information. This approach strikes a balance between the compactness of maximal itemsets and the completeness of all frequent itemsets. Closed itemset mining is particularly useful in scenarios where both the frequency and the exact composition of itemsets are important, but compression of the results is desired.

## Formal Definition
Let:
Expand Down Expand Up @@ -37,39 +37,36 @@ CFI = {X \mid X \subseteq I \wedge \sigma(X) \geq \sigma_{min} \wedge \nexists Y
Closed itemsets can be used to recover all frequent itemsets by generating combinations from the mined itemsets along with their supports. This can be accomplished thorugh the levelwise algorithm proposed by Pasquier et al. in 1999.


The `levelwise` function implements the levelwise algorithm for recovering frequent itemsets from closed itemsets. This algorithm generates all subsets of the closed itemsets, derives their supports, and then returns the results. This particular implementation is designed to take a result datafrom from the various closed itemset mining algorithms in this package as its input and thus can only return absolute support (`N`), rather than both relative support and absolute support.
The `levelwise` function implements the levelwise algorithm for recovering frequent itemsets from closed itemsets. This algorithm generates all subsets of the closed itemsets, derives their supports, and then returns the results. This particular implementation is designed to take an output `DataFrame` from the various closed itemset mining algorithms in this package. Without the original transactions dataset, its input and return values can only handle absolute support (`N`), rather than both relative support and absolute support.

```@docs
levelwise(df::DataFrame, min_n::Int)
```
## Algorithms

### CHARM

The `charm` function implements the CHARM ([C]losed, [H]ash-based [A]ssociation [R]ule [M]ining) algorithm for mining closed itemsets proposed by Mohammad Zaki and Ching-Jui Hsiao in 2002. This algorithm uses a depth-first search with hash-based approaches to pruning non-closed itemsets and is particularly efficient for dense datasets.
The `charm` function implements the CHARM ([C]losed, [H]ash-based [A]ssociation [R]ule [M]ining) algorithm for mining closed itemsets proposed by Mohammad Zaki and Ching-Jui Hsiao in 2002. This algorithm uses a depth-first search with hash-based pruning approaches for non-closed itemsets and is particularly efficient for sparse datasets.

```@docs
charm(txns::Transactions, min_support::Union{Int,Float64})
```

### FPClose
The `fpclose` function implements the FPClose ([F]requent [P]attern Close) algorithm for mining closed itemsets. This algorithm, proposed by Gösta Grahne and Jianfei Zhu in 2005, builds on the FP-Growth alogrithm to discover closed itemsets in a dataset without candidate generation.
The `fpclose` function implements the FPClose ([F]requent [P]attern Close) algorithm for mining closed itemsets. This algorithm, proposed by Gösta Grahne and Jianfei Zhu in 2005, builds on the FP-Growth alogrithm to discover closed itemsets in a dataset without candidate generation. It inherits many of the advantages of FP-Growth when it comes to dense datasets.

```@docs
fpclose(txns::Transactions, min_support::Union{Int,Float64})
```
### LCM

The `LCM` function implements the LCM ([L]inear-time [C]losed [M]iner) algorithm for mining frequent closed itemsets first proposed by Uno et al. in 2004. This is an efficient method for discovering closed itemsets in a dataset with a linear time complexity.

### LCM
The `LCM` function implements the LCM ([L]inear-time [C]losed [M]iner) algorithm for mining frequent closed itemsets first proposed by Uno et al. in 2004. This is an efficient method for discovering closed itemsets in a dataset with a linear time complexity. It is typically faster than other algorithms and has a more balance profile that achieves fast mining on both sparse and dense datasets.

```@docs
LCM(txns::Transactions, min_support::Union{Int,Float64})
```

### CARPENTER

The `carpenter` function implements the CARPENTER ([C]losed [P]att[e]r[n] Discovery by [T]ransposing Tabl[e]s that a[r]e Extremely Long) algorithm for mining closed itemsets proposed by Pan et al. in 2003. This algorithm uses a transposed structure to optimize for datasets that have far more items than transactions, such as those found in genetic research and bioinformatics. It may not be the best choice if your data does not fit that format.
The `carpenter` function implements the CARPENTER ([C]losed [P]att[e]r[n] Discovery by [T]ransposing Tabl[e]s that a[r]e Extremely Long) algorithm for mining closed itemsets proposed by Pan et al. in 2003. This algorithm uses a transposed structure to optimize for datasets that have far more items than transactions, such as those found in genetic research and bioinformatics. It is not well suited to datasets in the more standard transaction-major format.

```@docs
carpenter(txns::Transactions, min_support::Union{Int,Float64})
Expand Down
16 changes: 8 additions & 8 deletions docs/src/frequent_itemsets.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Frequent Itemset Mining

![image](./images/Frequent.png)
![Diagram showing maximal itemsets as a subset of closed itemsets which are a subset of frequent itemsets](./assets/frequent.png)
## Description

Frequent itemset mining is a fundamental technique in data mining focused on discovering itemsets that appear frequently in a transactional dataset. A frequent itemset is a set of items that occurs together in the data with a frequency no less than a specified minimum support threshold. Frequent itemsets form the basis of various data mining tasks, including association rule mining, sequential pattern mining, and correlation analysis.

The one caveat is that depenind on the support parameter and the structure of the data, these mining techniques can yield large numbers of patterns, especially in dense datasets or with low support thresholds. This challenge has led to the development of more concise representations like closed and maximal itemset mining.
The one caveat with frequent itemset mining is that depending on the support parameter and the structure of the data, these mining techniques can yield large numbers of patterns, especially in dense datasets or with low support thresholds. This challenge has led to the development of more concise representations like closed and maximal itemset mining.

## Formal Definition
Let:
Expand All @@ -30,18 +30,18 @@ FI = {X \mid X \subseteq I \wedge \sigma(X) \geq \sigma_{min}}

## Algorithms

### FP-Growth
### ECLAT

The `fpgrowth` function implements the FP-Growth ([F]requent [P]attern Growth) algorithm for mining frequent itemsets. This algorithm, proposed by Han et al. in 2000, is an efficient method for discovering frequent itemsets in a dataset without candidate generation. It is generally more efficient than other algorithms when transactions have large numbers of items
The `eclat` function implements the [E]quivalence [CLA]ss [T]ransformation algorithm for frequent itemset mining proposed by Mohammad Zaki in 2000. This algorithm identifies frequent itemsets in a dataset utilizing a column-first search and supplied minimum support.

```@docs
fpgrowth(txns::Transactions, min_support::Union{Int,Float64})
eclat(txns::Transactions, min_support::Union{Int,Float64})
```

### ECLAT
### FP-Growth

The `eclat` function implements the [E]quivalence [CLA]ss [T]ransformation algorithm for frequent itemset mining proposed by Mohammad Zaki in 2000. This algorithm identifies frequent itemsets in a dataset utilizing a column-first search and supplied minimum support.
The `fpgrowth` function implements the FP-Growth ([F]requent [P]attern Growth) algorithm for mining frequent itemsets. This algorithm, proposed by Han et al. in 2000, is an efficient method for discovering frequent itemsets in a dataset without candidate generation. It is generally more efficient than other algorithms when datasets are dense, as the internal FP tree data structure it builds efficiently summarizes the relationships and supports of the itemsets.

```@docs
eclat(txns::Transactions, min_support::Union{Int,Float64})
fpgrowth(txns::Transactions, min_support::Union{Int,Float64})
```
Loading

0 comments on commit 10726ba

Please sign in to comment.