This package for R implements the Classification based on Associations algorithm (CBA):
Liu, B. Hsu, W. and Ma, Y (1998). Integrating Classification and Association Rule Mining. Proceedings KDD-98, New York, 27-31 August. AAAI Press. pp 80-86.
The arules package is used for the rule generation step.
The package is also available in R CRAN repository as Association Rule Classification (arc) package.
This package for R is described in an R Journal article
Hahsler, M., Johnson, I., Kliegr, T., & Kuchar, J. (2019). Associative Classification in R: arc, arulesCBA, and rCBA. R Journal, 9(2).
- Automatic discretization of predictor attributes
- Automatic tuning of support and confidence thresholds
- Pure R package
The package can be installed directly from CRAN using the following command executed from the R environment:
install.packages("arc")
Development version can be installed from github from the R environment using the devtools package.
devtools::install_github("kliegr/arc")
library(arc)
set.seed(101)
# dataset setup
iris_shuffled <- datasets::iris[sample(nrow(datasets::iris)),]
train <- iris_shuffled[1:100,]
test <- iris_shuffled[101:nrow(iris_shuffled),]
classatt <- "Species"
# learn, apply and evaluate the CBA classifier
rm <- cba(train, classatt)
prediction <- predict(rm, test)
acc <- CBARuleModelAccuracy(prediction, test[[classatt]])
print(acc)
# interpret by listing the rules in the classifier
inspect(rm@rules)
Association rule learning often generates large number of rules. This shows how to use the arc package to reduce the size of the rule set.
library(arc)
data(Adult)
classitems <- c("income=small","income=large") #define target attribute (consequent)
rules <- apriori(Adult, parameter = list(supp = 0.05, conf = 0.5, target = "rules"), appearance=list(rhs=classitems, default="lhs"))
# now we have 1266 rules
pruned <- prune(rules,Adult,classitems)
inspect(pruned)
# only 174 after pruning with arc
Additional reduction of the size of the rule set can be achieved by setting greedy_pruning=TRUE
.
pruned <- prune(rules, Adult, classitems, greedy_pruning=TRUE)
inspect(pruned)
# produces 141 rules
The resulting rule list can also be used as a classifier.
In some cases, pruning does not produce sufficiently concise rule list. Function topRules
allows the user to set the target number of rules that will be used as an input for classifier building, thus serving as the upper bound on rule count.
The arules documentation gives the following example:
data("Adult")
rules <- topRules(Adult, target_rule_count = 100, init_support = 0.5, init_conf = 0.9, minlen = 1, init_maxlen = 10)
summary(rules)
This will return exactly 100 rules. These can then be passed to CBA for pruning:
pruned <- prune(rules, Adult, classitems, greedy_pruning=TRUE)
The resulting classifier stored in pruned
has 33 rules.
First, let's consider a classifier similar to the one learnt in Use case 1, which in prediction
contains predicted classes for each instance in test
:
Consider test instance 1:
test[1,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
51 7.7 3.8 6.7 2.2 virginica
The prediction is
prediction[1]
[1] virginica
Levels: setosa versicolor virginica
Consider test instance 1:
firingRuleIDs <- predict(rm, testFold, outputFiringRuleIDs=TRUE)
inspect(rm@rules[firingRuleIDs[1]])
confidence_scores <- predict(rm, testFold, outputConfidenceScores=TRUE)
For a particular instance:
rm@rules[firingRuleIDs[1]]@quality$confidence
rm@rules[firingRuleIDs[1]]@quality$orderedConf
rm@rules[firingRuleIDs[1]]@quality$cumulativeConf
Explanation:
-
rule confidence is computed as
$a/(a+b)$ , where$a$ is the number of instances matching both the antecedent and consequent (available in slotsupport
) and$b$ is the number of instances matching the antecedent but not matching the consequent of the given rule.
The arc package provides two alternative measures:
- order-sensitive confidence is computed only from instances reaching the given rule. Note that CBA generates ordered rule lists.
- cumulative confidence is an experimental measure computed as the accuracy of the rule list comprising the given rule and all higher priority rules (rules with lower index) with uncovered instances excluded from the computation.
library(ROCR)
set.seed(101)
classitems <- c("income=small","income=large")
adult <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
sep = ',', fill = F, strip.white = T, col.names = c('age', 'workclass', 'fnlwgt', 'educatoin',
'educatoin_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income'))
split = sample(c(TRUE, FALSE), nrow(adult), replace=TRUE, prob=c(0.75, 0.25))
trainFold <- adult[split,]
testFold <- adult[!split,]
classAtt <- "income"
positiveClass<-">50K"
rm <- cba(trainFold, classAtt, list(target_rule_count = 1000))
confidence_scores <- predict(rm, testFold, outputConfidenceScores=TRUE,positiveClass=positiveClass)
pred_cba <- ROCR::prediction(confidence_scores, factor(testFold[[classAtt]]))
roc_cba <- ROCR::performance(pred_cba, "tpr", "fpr")
ROCR::plot(roc_cba, lwd=2, colorize=TRUE)
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1)
auc <- ROCR::performance(pred_cba, "auc")
auc <- unlist(auc@y.values)
auc
> auc
[1] 0.8946532
- When invoking
topRules
, setinit_maxlen
parameter to a low value:
data("Adult")
classitems <- c("income=small","income=large")
rules <- topRules(Adult, target_rule_count = 100, init_support = 0.05, init_conf = 0.5, minlen = 1, init_maxlen = 2, appearance=list(rhs=classitems, default="lhs"))
inspect(rules)
- Experiment with the value of the
rule_window
parameter. This has no effect on the quality of the classifier. - Set
greedy_pruning
to TRUE. This will have generally slightly adverse impact on the quality of the classifier, but it will decrease the size of the rule set and reduce the time required for pruning. Greedy pruning is not part of the CBA algorithm as published by Liu et al (1998).