response.tex

\begin{center}
	\large
	Response Letter
\end{center}
We thank the reviewers for their useful comments and feedback on improving the manuscript and the presentation. In the following, we summarize changes reflected in the revised manuscript.


\paragraph{Distribution shifts in life-long learning.} We propose Hamming distance heuristics in {\imli} with the assumption that distribution remains same across batches. One way to account for distribution shifts is to consider last $ p $  $ (> 1) $ batches instead of the (single) previous batch in the objective function in mini-batch learning. For a feature variable $ B^i_j $, we consider its majority assignment in last $ p $ classification rules and encode as a soft clause to retain the majority assignment in the current batch. Moreover, we can reweigh the soft clause by prioritizing assignments of $ B^i_j $ in recent batches. We mention this in Section~\ref{interpretability_imli_sec:incremental_learning}, page 28.
%When distribution changes, the classifier $ \mathcal{R}_{i} $ for the $ i^ $ 


\paragraph{Discussions of related works in experiments.}{\imli} scales well than existing interpretable rule-based classifiers because of the incremental solving approach. Most existing works rely on rule-mining of potential classification rules followed by an optimization algorithm such as Bayesian optimization and branch and bound algorithm. In experiments, we observe the incremental learning of {\imli} to be more scalable  than the state-of-the-art methods. We elaborate this in Section~\ref{interpretability_imli_sec:experiments_scalability}, page 39.


\paragraph{How does {\justicia} outperform a direct approach by learning a Bayesian Network on limited samples?} We agree with the reviewer that learning a Bayesian Network on limited samples does not always result in a better robustness (e.g., less standard deviation of fairness metrics estimation) of {\justicia} than the direct approach of estimating fairness on a dataset. 

Elaborately, in Chapter~\ref{chapter:justicia} Figure~\ref{fairness_justicia_fig:sample-size}, {\justicia} demonstrates higher robustness than the direct approach, where we consider a specific distribution of non-sensitive features conditioning only on sensitive features. Thus, Figure~\ref{fairness_justicia_fig:sample-size} does not involve experiments with Bayesian Network capturing correlations of all features, which we indeed introduce later in Chapter~\ref{chapter:fvgm}. In our revised experiment, we observe that {\justicia} with Bayesian network, called {\fvgm}, exhibits less robustness due to Bayesian network learning.


\paragraph{Clarification on Fairness Influence Function (FIF).} Our additive axiom for FIFs is based on the idea of decomposing the total unfairness of the classifiers among different subsets of features~\cite{begley2020explainability,lundberg2020explaining}. We consider that the sum of FIFs of all subsets of non-sensitive features is equal \textit{to the resultant unfairness of the classifier}, where unfairness is a real number in $ [0,1] $, such as statistical parity, equalized odds, and predictive parity.

\paragraph{Clarification on Bayesian Network.} We consider a Bayesian network as an input distribution to express the conditional dependencies and independencies among features. In Chapter~\ref{chapter:CNF_feature_correlation}, we demonstrate the encoding of probabilistic inference into SSAT via additional Boolean variables and clauses~\cite{chavira2008probabilistic}. We also discuss the complexity of the SSAT encoding in terms of the complexity of the Bayesian network (ref.~\ref{chapter_fairness_preliminaries_BN}, page 15).