manifoldBasedTestGeneration.tex

Neural networks created and deployed in mission critical domains must demonstrate high confidence in their predictions. However, without appropriate training data, it is impossible to evaluate the accuracy of the model. In addition, without considering edge cases and other special scenarios which may or may not be infrequent, the neural networks cannot be trained to handle such cases and make accurate predictions when encountering such scenarios. In high-dimensional input data applications such as image classification, required patterns cannot be easily captured by the available training dataset, which could result in these situations quite easily. 

Manifold-based test generation~\cite{byun2020manifoldtestgen,byun2020manifoldassurance,byun2021black} is a technique that provides a means for capturing the necessary patterns such that the NN can learn these patterns in a low-dimensional manifold space. This is achieved by projecting the data points from a high-dimensional input space to a low-dimensional space. The approach uses a Conditional Variational Autoencoder (CVAE) to capture the manifold space, which is then utilized to generate novel fault-revealing test cases. Note that a unique feature of this approach is that these test cases are generated along with the labels. The resulting fault-revealing test cases can be utilized in two ways: 1) to create a test suite that can evaluate the performance of the neural network, 2) to include the fault revealing test cases as part of the training set such that the neural network performance and accuracy can be improved.

\begin{figure}[h]
%	\vspace{-5mm}
	\includegraphics[width=\linewidth]{Fig/manifold_workflow.pdf}
%	\vspace{-20mm}
	\caption{Manifold-based test generation workflow.}
	\label{fig:manifold_workflow}
\end{figure}

Fig~\ref{fig:manifold_workflow} shows the workflow of the manifold based test generation tool. First, the NN is trained using the dataset as shown in Fig~\ref{fig:manifold_workflow} (step 1). The trained NN can be evaluated for its performance using a separate testing dataset, which is not shown in this workflow. Next, a trained VAE is generated by using the same training dataset used for the NN (step 2). Once the trained VAE model is developed, it can then be used to train the Latent Space Classifier (LSC), as shown in Fig~\ref{fig:manifold_workflow} (step 4). Note that if needed, the the quality of the trained VAE can be evaluated by measuring its Frechet Inception Distance (FID) score (step 3). The closer the FID score to zero, the better the trained VAE is assumed to be. Note that this step can also be performed before training the LSC. The LSC is responsible for learning the manifold space. Further, in order to generate fault-revealing test cases from the manifold space, the trained NN and the LSC are passed as an input to the test generation algorithm (step 5). The algorithm then selects the test cases from the manifold space, which includes the label and evaluates it using the NN. If the NN mis-predicts the output, the test case is regarded as a fault-revealing test. The algorithm is able to generate the desired number of faulty test cases, which can be utilized as part of the training data set to re-train the original NN (step 6). Note that in the figure this step is represented using a dashed line because this feature is currently not provided by the tool.

\subsection{Tool evaluation:} We evaluated the tool on the MNIST dataset and followed the workflow as describe in Fig~\ref{fig:manifold_workflow}. Fault revealing test cases can be generated by following various algorithms, but the two utilized in our evaluation are the random test generation and the search-based test generation algorithms. We show the results for both of them below. 

\begin{itemize}
	\item \textit{Random Test Case Generation:} Using this test generation algorithm, the desired number of test cases are generated. However, not all test cases may be fault-revealing. Fig~\ref{fig:random} represents the generation of 50 random test cases. Out of the 50 test cases, only five of these were identified as fault-revealing. The test cases are shown with different color boxes. From Fig~\ref{fig:random}, it is impossible to understand the reasoning behind the selection of the fault-revealing test cases, meaning one cannot tell whether the tool identified the test cases correctly or not. Hence, a visual representation can provide better insight. Therefore, we have generated visual representations for each of the test cases in Fig~\ref{fig:random} as shown in Fig~\ref{fig:random_visual}. %Again, each fault-revealing test case is color-coded for ease of understanding. 
	\begin{figure}[h]
		%	\vspace{-5mm}
		\includegraphics[width=\linewidth]{Fig/random.png}
		%	\vspace{-20mm}
		\caption{Randomly generated test cases.}
		\label{fig:random}
	\end{figure}

	\begin{figure}[h]
		%	\vspace{-5mm}
		\includegraphics[width=\linewidth]{Fig/random_visual.png}
		%	\vspace{-20mm}
		\caption{Visual representation of randomly generated test cases.}
		\label{fig:random_visual}
	\end{figure}
	\item \textit{Search-Based Test Case Generation:}  Using this test generation algorithm, all the generated test cases are fault-revealing. Fig~\ref{fig:search-based} represents generation of 15 search-based test cases. As shown in the figure, all of the 15 test cases were fault-revealing and they are shown in different color boxes for clarity. Fig~\ref{fig:search_visual} shows the visual representation of each of these fault-revealing test cases. %Again, each of these fault-revealing test cases are color coded for ease of understanding.
	\begin{figure}[h]
		%	\vspace{-5mm}
		\includegraphics[width=\linewidth]{Fig/search-based.png}
		%	\vspace{-20mm}
		\caption{Search-based generated test cases.}
		\label{fig:search-based}
	\end{figure}

	\begin{figure}[h]
		%	\vspace{-5mm}
		\includegraphics[width=\linewidth]{Fig/search_visual.png}
		%	\vspace{-20mm}
		\caption{Visual representation of search-based generated test cases.}
		\label{fig:search_visual}
	\end{figure}
\end{itemize}
\subsection{Tool Limitations:} Based on our evaluation, the tool provides a unique mechanism for generating fault-revealing test cases for high-dimensional datasets. However, the tool still has the following limitations:
\begin{itemize}
	\item The generated fault-revealing test cases require manual inspection to be considered as part of the training test suite to improve prediction accuracy. As seen from Figures \ref{fig:random_visual} and \ref{fig:search_visual}), it is obvious that some of the predictions and their visual representations do not match and hence require a manual step to analyze the generated test cases. This could become a significant bottleneck when generating a large number of fault-revealing test cases. An automated approach to address this issue would make the tool significantly more useful. 
	\item The tool was developed as a prototype to demonstrate the proof of concept and still requires several updates to completely become usable in this space. For instance, the algorithm for search-based test generation was modified slightly to capture only fault-revealing test cases. 
	\item The tool is currently capable of working with MNIST, CIFAR, FASHION, and EMNIST datasets. To use other datasets as inputs, minor tool modifications are required.
	\item The approach works on high-dimensional datasets; however, for low-dimensional classification problems the tool is unable to handle the datasets appropriately. We believe the approach is sound and should be easily applicable to low-dimensional classification problems. This could easily enable even the neural networks in other classification problem domains to acquire better accuracy. However, the tool is currently unable to handle low-dimensional training datasets and therefore requires modification. The extent of needed modifications was not evaluated as part of our process.
\end{itemize}