-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2-SML.tex
278 lines (201 loc) · 26.6 KB
/
2-SML.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
\chapter{Standard Machine Learning Language} \label{Chapter:SML}
\section{Introduction}
\label{Introduction}
\blfootnote{
This chapter contains material from the following working paper:
\nobibliography{thesisbib}
\begin{itemize}
\item\bibentry{SML}
\end{itemize}
}
Machine Learning has simplified the process of solving complicated problems in a variety of fields (see \cite{ML-UseCase1} or \cite{Monahan} for examples). However, \cite{pedros:fewUsefulThings} noted several challenges to consider when developing machine learning pipelines. If one does not consider these challenges, one may receive unsatisfactory results. Thus, we introduce the Standard Machine Learning Language (SML). A simplified representation of the machine learning process targeted at domain experts who want to utilize machine learning to solve their research questions without needing to learn the intricacies of coding and deploying machine learning pipelines.
The overall objective of SML is to provide a level of abstraction that simplifies the development process of machine learning pipelines. Consequently, this will enable students, researchers, and industry professionals who lack a background in developing machine learning pipelines to be able to solve problems in different domains by using machine learning (see Listing \ref{lst:sml-ex-1} for an example) approaches. In the subsequent sections, we first discuss related works, followed by defining the grammar used to create SML queries. Next, we describe the architecture of SML. Finally, SML is applied to several use-cases to demonstrate how our approach reduces the complexity of solving problems that utilize machine learning.
\section{Prior Works}
\label{SML:PriorWorks}
Several authors have prior related works that attempt to provide a level of abstraction for developing machine learning models. \cite{RizzoloRo10} created a tool called LBJava based on a programming paradigm called Learning Based Programming (see \cite{Roth05}). Learning Based Programming is an extension of conventional programming that creates functions using data-driven approaches. LBJava utilizes machine learning to create these functions and abstract the details from the user. What differentiates SML from LBJava is that SML offers a higher level of abstraction by providing a query-like language, allowing people who aren't experienced programmers to use SML.
TPOT (see \cite{TPOT}) is a tool implemented in Python that creates and optimizes machine learning pipelines using genetic programming. Given cleaned data, TPOT preprocess the data, performs feature selection, and constructs machine learning models. Given the task (classification, regression, or clustering), TPOT uses genetic programming to tune model parameters and select features to determine the most optimal model to use. Similar to TPOT, \cite{kotthoff_auto_2019} developed Auto-WEKA, which automates the selection of learning algorithms and tuning hyper-parameters for implemented models in WEKA (see \cite{frank2005weka}).
Subsequently, \cite{komer_hyperopt_2019} created Hyperopt-Sklearn that provides automated algorithm selection from models in the Scikit-learn machine learning library \footnote{see \cite{scikit-learn} for an introduction to Scikit-learn}, in a similar manner to Auto-Weka. \cite{feurer_auto_2018} introduced improvements upon Hyperopt-Sklearn by taking into account past performance on similar datasets and constructing ensembles from optimized models. What differentiates SML from these prior works is that it provides an agnostic language to reduce the amount of programming required to write and it offers a visualization framework to assess the models' performance.
\section{Grammar}
\label{grammar}
The SML language is a domain-specific language with grammar implemented in Bakus-Naur form (BNF: see \cite{Backus59}). Each expression has a rule and can be expanded into additional terms. Listing \ref{lst:sml-ex-1} is an example of how one would perform classification on a dataset using SML. The query in listing \ref{lst:sml-ex-1} reads from a dataset, performs an 80/20 split of training and testing data, respectively, and performs classification on the fifth column of the hypothetical dataset using columns 1, 2, 3, and 4 as predictors. In the subsequent subsections, SML's grammar in BNF form is defined in addition to the keywords.
\subsection{Grammar Structure}
This subsection is dedicated to defining the grammar of SML in BNF. A \(Query\) can be defined by a delimited list of actions where the delimiter is an \(AND\) statement; in the BNF this is defined as:
\begin{equation} \label{BNF:Query}
<Query> ::= <Action> | <Action> AND <Query>
\end{equation}
An \(Action\) in (\ref{BNF:Query}) follows one of the following structures defined in (\ref{BNF:Action}) where a \(Keyword\) is required followed by an \(Argument\) and/or \(OptionList\).
\begin{equation} \label{BNF:Action}
\begin{split}
<Action> ::= <Keyword> <Argument> \\
| <Keyword> <Argument> (<Option List>) \\
| <Keyword> (<Option List>)
\end{split}
\end{equation}
A \(Keyword\) is a predefined term associating an \(Action\) with a particular string. An \(Argument\) is generally a single string surrounded by quotes that specifies a path to a file. Lastly, an \(Argument\) can accept a multitude of options (\ref{BNF:Option}), where an \(Option\) consists of an \(OptionName\) with either an \(OptionValue\) or \(OptionValueList\). An \(OptionName\) and \(OptionValue\) consist of a single string. An \(OptionList\) (\ref{BNF:OptionList}) consists of a comma delimited list of options, and an \(OptionValueList\) (\ref{BNF:OptionValueList}) consists of a comma delimited list of \(OptionValues\).
\begin{equation} \label{BNF:Option}
\begin{split}
<Option> ::= <Option Name> = <Option Value> \\
| <Option Name> = [<Option Value List>]
\end{split}
\end{equation}
\begin{equation} \label{BNF:OptionList}
\begin{split}
<Option List> ::= <Option> | <Option>, <Option List>
\end{split}
\end{equation}
\begin{equation} \label{BNF:OptionValueList}
\begin{split}
<Option Value List> ::= <Option Value> \\
| <Option Value> , <Option Value List>
\end{split}
\end{equation}
To put the grammar into perspective, the example \(Query\) in Listing \ref{lst:sml-ex-1} has been transcribed into BNF format and can be found in Listing \ref{lst:SML:BNFComp}. In Listing \ref{lst:SML:BNFComp}, the first \(Keyword\) is \(READ\) followed by an \(Argument\) that specifies the path to the dataset. Next, an \(OptionValueList\) contains information about the delimiter of both the dataset and the header. We include the \(AND\) delimiter to specify an additional \(Keyword\) \(SPLIT\) with an \(OptionValueList\) that tells us the size of the training and testing partitions for the dataset specified with the \(READ\) \(Keyword\). Finally, we use the \(AND\) delimiter to specify another \(Keyword\) \(CLASSIFY\), which performs classification using the training and testing data from the result of the \(SPLIT\) \(Keyword\) followed by an \(OptionValueList\), which provides information to SML about the features to use (columns 1-4), the label we want to predict (column 5), and the algorithm to use for classification. The next subsection describes the functionality for all \(Keyword\)s in SML.
\subsection{Keywords}
Currently, there are eight \(Keyword\)s in SML \footnote{Detailed documentation providing examples and describing all of the keywords of SML are publicly available on GitHub: https://github.com/lcdm-uiuc/sml/tree/master/dataflows \label{SML:Dataflow}}. These \(Keyword\)s can be chained together to perform a variety of actions. In the subsequent subsections, we describe the functionality of each \(Keyword\).
\subsubsection{Reading Datasets}
When reading data from SML one must use the \(READ\) \(Keyword\) followed by an \(Argument\) containing a path to the dataset. \(READ\) also accepts a variety of \(Option\)s. The first \(Query\) in listing \ref{lst:SML:READ} consists of only a \(Keyword\) and \(Argument\). This \(Query\) read data from "/path/to/dataset". The second \(Query\) includes an \(OptionValueList\), in addition to reading data from the specified path, the \(OptionValueList\) specifies that the dataset is delimited with semicolons and does not include a header row.
\subsubsection{Cleaning Data}
When NaNs, NAs, or other missing values are present in the dataset, we handle (or impute) these instances in SML by using the \(REPLACE\) \(Keyword\). Listing \ref{lst:SML:REPLACE} shows an example of the \(REPLACE\) \(Keyword\) in practice. In this \(Query\), we use the \(REPLACE\) \(Keyword\) in conjunction with the \(READ\) \(Keyword\). SML reads from a comma-delimited dataset with no header from the path "/path/to/dataset". Next, we replace any instance of "NaN" with the mode of that column in the dataset.
\subsubsection{Partitioning Datasets}
A common practice is to split a dataset into training and testing datasets for most machine learning tasks. Splitting a dataset can be achieved in SML by using the \(SPLIT\) \(Keyword\). Listing \ref{lst:SML:SPLIT} shows an example of a SML \(Query\) performing an 80/20 split for training and testing data respectively by utilizing the \(SPLIT\) \(Keyword\) after reading in data.
\subsubsection{Creating Models}
In SML, one can create a model to either perform classification, regression, or clustering. To use a classification model in SML one would use the \(CLASSIFY\) \(Keyword\). SML implements the following classification models: Support Vector Machines, Na\"ive Bayes, Random Forest, Logistic Regression, and K-Nearest Neighbors. Listing \ref{lst:SML:CLASSIFY} demonstrates how to use the \(CLASSIFY\) \(Keyword\) in a \(Query\). Clustering models can be utilized by using the \(CLUSTER\) \(Keyword\). SML only has K-Means clustering currently implemented. Listing \ref{lst:SML:CLUSTER} demonstrates how to use the \(CLUSTER\) \(Keyword\) in a \(Query\). Regression models use the \(REGRESS\) \(Keyword\). SML currently has the following regression algorithms: Simple Linear Regression, Ridge Regression, Lasso Regression, and Elastic Net Regression. Listing \ref{lst:SML:REGRESS} demonstrates how to use the \(REGRESS\) \(Keyword\) in a \(Query\).
\subsubsection{Saving/Loading Models}
The approach adopted by SML allows a user to easily save, share, and reuse models. To save a model in SML ,one would use the \(SAVE\) \(Keyword\) in a \(Query\). To load an existing model within SML, one would use the \(LOAD\) \(Keyword\) in a \(Query\). Listing \ref{lst:SML:SAVE_LOAD} shows the syntax required to save and load a model by using SML. With any query using \(REGRESS\), \(CLUSTER\), or \(CLASSIFY\) \(Keyword\)s, attaching \(SAVE\) to the \(Query\) will save the model.
\subsubsection{Visualizing Datasets and Metrics of Algorithms}
When using SML it is possible to visualize datasets or the performance of models (such as learning curves or ROC curves). To do this, the \(PLOT\) \(Keyword\) must be specified in a \(Query\). Listing \ref{lst:SML:PLOT} shows an example of how to use the \(PLOT\) \(Keyword\) in a \(Query\). We apply the same operations to perform clustering in Listing \ref{lst:SML:CLUSTER}, however, we utilize the \(PLOT\) \(Keyword\) to visualize the results.
\section{SML's Architecture}
\label{sml-architecture}
With SML's grammar now defined, we now transition to an explanation of SML's architecture. When SML receives a \(Query\) in the form of a string, it is passed to the parser. As the string is parsed, the pre-defined grammar is used to determine which actions to perform. These actions are stored in a dictionary and given to one of the following SML phases: Model Phase, Apply Phase, or Metrics Phase. Figure \ref{fig:SML:Architecture} provides a block diagram of this process.
The model phase is used to construct a model. The \(Keyword\)s that generally invoke the model phase are: \(READ\), \(REPLACE\), \(CLASSIFY\), \(REGRESS\), \(CLUSTER\), and \(SAVE\). The apply phase is used to apply a preexisting model to new data. The \(Keyword\) that invokes the apply phase is \(LOAD\), which is often useful to visualize new data and model performance metrics. By default, specifying the \(PLOT\) \(Keyword\) in a \(Query\) will force SML to execute the metrics phase.
The last significant component of SML's architecture is the connector. The connector connects drivers from different libraries and languages to achieve an action a user wants during a particular phase (see Figure \ref{fig:SML:Connector}). If one considers applying linear regression on a dataset, SML calls the connector to retrieve the linear regression library during the model phase. In this case, SML uses sci-kit learn's implementation. However, if we wanted to use an algorithm not available in sci-kit learn, such as a Hidden Markov Model (HMM), SML will use the connector to call another library, potentially in another programming language, that supports HMM.
\section{Interface}
\label{interface}
There are multiple interfaces available for working with SML. We have developed an alpha version of a web tool that allows users to write queries and to retrieve results from SML through a web interface (see Figure \ref{fig:SML:website}). There is also a REPL environment available that allows the user to write queries and display results from the appropriate phases of SML interactively. Lastly, users can import SML into an existing pipeline to simplify the development process of applying machine learning to specific problems.
\section{Use Cases}
\label{use-cases}
We tested SML's framework against ten popular machine learning problems with publicly available data sets. We applied SML to the following datasets: Iris Dataset \footnote{https://archive.ics.uci.edu/ml/datasets/Iris}, Auto-MPG Dataset \footnote{https://archive.ics.uci.edu/ml/datasets/Auto+MPG}, Seeds Dataset \footnote{https://archive.ics.uci.edu/ml/datasets/seeds}, Computer Hardware Dataset \footnote{https://archive.ics.uci.edu/ml/datasets/Computer+Hardware}, Boston Housing Dataset \footnote{https://archive.ics.uci.edu/ml/datasets/Housing}, Wine Recognition Dataset \footnote{https://archive.ics.uci.edu/ml/datasets/Wine}, US Census Dataset \footnote{https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)}, Chronic Kidney Disease \footnote{https://archive.ics.uci.edu/ml/datasets/Chronic\_Kidney\_Disease}, and the Spam Detection \footnote{https://archive.ics.uci.edu/ml/datasets/Spambase}, which were all obtained from the UCI's Machine Learning Repository (see \cite{Lichman:2013}). We also applied SML to the Titanic Dataset \footnote{https://www.kaggle.com/c/titanic}.
As mentioned in footnote \ref{SML:Dataflow} there are detailed examples and explanations for all ten data sets. In this section, we discuss the process of applying SML to the Iris Dataset and the Auto-MPG dataset. We compare the process for using machine learning to solve the problems presented by the datasets with SML against hand written code. We do not compare SML to the prior works mentioned in section \ref{SML:PriorWorks}, as the different approaches used by these researchers complicates a direct comparison. We used the same libraries and programming language in SML to solve these use cases for both of these datasets.
\subsubsection{Iris Dataset}
Listing \ref{lst:SML:IrisQuery} shows the code required to perform classification on the Iris dataset by using SML encoded in Python. We read in data in Listing \ref{lst:SML:IrisQuery} from a specified path named "iris.csv" from a subdirectory called "data", perform an 80/20 split into training and testing subsets, use the first four columns to predict the fifth column, employ the support vector machine algorithm to perform classification, and finally plot the distributions of features from the dataset and the performance metrics of the classification model. Appendix \ref{Appendix:Iris} illustrates what is required to perform the same operations by using Python and sci-kit learn. The \(Query\) in listing \ref{lst:SML:IrisQuery} and the code in Appendix \ref{Appendix:Iris} use the same third-party libraries implicitly or explicitly. It is also worth noting that the code in Appendix \ref{Appendix:Iris} is publicly available and well documented \footnote{For detailed documentation describing this code visit: https://github.com/lcdm-uiuc/sml/blob/master/dataflows/plot/iris\_svm-READ-SPLIT-CLASSIFY-PLOT.ipynb \label{lab:iris:git}}. Rather than delve into the intricacies of the code itself, we instead outline the complexities required to produce such results with and without SML. The result for both snippets of code are the same and is in Figure \ref{fig:IrisResults}.
\subsubsection{Auto-Mpg Dataset}
Listing \ref{lst:SML:AutoMPGQuery} shows the SML \(Query\) required to perform regression on the Auto-MPG dataset in Python. In listing \ref{lst:SML:AutoMPGQuery}, we read data from a specified path, indicate fixed-width spaces separate columns in the dataset, and we specify there is no header for the dataset. Next, we perform an 80/20 split, replace all occurrences of "?" with the column's mode. We then perform linear regression using columns 2-8 to predict the first label. Lastly, we visualize distributions of the features from the dataset and the performance metrics of our algorithm. Appendix \ref{Appendix:Auto}, demonstrates solving this probably directly by performing the same operations by using sci-kit learn \footnote{For a detailed documentation describing this code visit: https://github.com/lcdm-uiuc/sml/blob/master/dataflows/plot/autompg\_linear\_regression-READ-SPLIT-REGRESS-PLOT.ipynb \label{lab:SML:AUTO}}. The outcome of both processes are the same and can be seen in Figure \ref{fig:AutoMPG:Results}.
\subsection{Discussion}
For the Iris and Auto-MPG use cases, the same libraries and programming language were used to perform regression and classification. The amount of work required to perform a task and produce the following results in Figure \ref{fig:AutoMPG:Results} and Figure \ref{fig:IrisResults} significantly decreases when using SML. Constructing each SML query used less than 10 lines of code; however, implementing the same procedures without SML by using the same programming language and libraries needed more than 70 lines of code. This provides evidence that SML simplifies the development process of solving problems with machine learning, especially for individuals that do not know how to write code in one of the supported programming languages.
\section{Future Work}
\RC{I changed this since we likely will never do this.}
While we have formally introduced an agnostic framework, much work remains to improve SML. As one example, we could extend the connector to support more machine learning libraries and additional programming languages. We could also extend SML's web application to include additional functionality to make the overall approach even easier. We also could implement additional machine learning tasks such as feature selection, model selection, and parameter optimization. In addition to improving SML, we also could directly compare SML to other approaches outlined in section \ref{SML:PriorWorks} to determine how beneficial SML is against alternative frameworks.
\section{Conclusion}
\label{conclusion}
To summarize, we introduced a language agnostic framework that employs a query-like language to simplify the development of machine learning pipelines. We provided a high-level overview of its architecture and its grammar. We applied SML to several machine learning problems and demonstrated how the code one has to write significantly decreases when SML is used. The source code and detailed documentation for SML is open-sourced and publicly available on Github \footnote{https://github.com/lcdm-uiuc/sml \label{SML:Github}}.
SML provides a new method to rapidly develop machine learning pipelines that has a low barrier to adoption, and that employs a language agnostic appraoch to solve problems. This attractive aspect can boost the productivity of researchers who utilize machine learning since abstracting machine learning complexities with a tool like SML can foster new research and solve problems in different disciplines faster.
\clearpage
\section{Figures and Listings}
\rotatebox{90}{\begin{minipage}{0.95\textheight}
\centerline{ \includegraphics[width=\textwidth]{figures/SML/architecture.png}}
\captionof{figure}{This figure shows a Block Diagram of SML's Connector.}
\label{fig:SML:Architecture}
\end{minipage}}
%\begin{sidewaysfigure}[!h]
%\includegraphics[width=.9\textwidth]{figures/SML/architecture.png}
%\centering
%\caption{Block Diagram of SML's Architecture\\}
%\label{fig:SML:Architecture}
%\end{sidewaysfigure}
\begin{sidewaysfigure}![h]
\includegraphics[width=1\textwidth]{figures/SML/connector.png}
\centering
\caption{This figure shows a Block Diagram of SML's Connector.}
\label{fig:SML:Connector}
\end{sidewaysfigure}
\begin{sidewaysfigure}[!h]
\includegraphics[width=1\textwidth]{figures/SML/sml-web-site.png}
\centering
\caption{This figure shows the interface of SML's webapp. Currently, users can read instructions, and examples of how to use SML are on the left pane. In the middle pane, users can type an SML \(Query\) and then hit the execute button. The results after executing a SML \(Query\) through SML are in the right pane.}
\label{fig:SML:website}
\end{sidewaysfigure}
\begin{sidewaysfigure}[!h]
\includegraphics[width=1\textwidth]{figures/SML/iris_results.png}
\centering
\caption{The SML \(Query\) in figure \ref{fig:SML:IrisQuery} and the code in figure \ref{fig:Manual:IrisCode} produce these results. The subgraph on the left is a lattice plot showing the density estimates of each feature used. The graph on the right shows the ROC curves for each class of the iris dataset.}
\label{fig:IrisResults}
\end{sidewaysfigure}
\begin{sidewaysfigure}[!h]
\includegraphics[width=1\textwidth]{figures/SML/auto-mpg-results.png}
\centering
\caption{The SML \(Query\) in figure \ref{fig:SML:AutoMPGQuery} and the code in Appendix \ref{Appendix:Auto}. produce these results. The subgraph on the left is a lattice plot showing the density estimates of each feature used. The top right graph shows the model's learning curve and the graph on the lower right shows the validation curve.}
\label{fig:AutoMPG:Results}
\end{sidewaysfigure}
\clearpage
\begin{lstlisting}[language=python, caption={Example of a SML Query Performing Classification.}, label={lst:sml-ex-1}]
READ "/path/to/data" (separator = ";", header = None)
AND SPLIT (train = 0.8, test=0.2) AND CLASSIFY
(predictors =[1,2,3,4], label = 5, algorithm = svm)
\end{lstlisting}
\begin{lstlisting}[language=python, caption={Here the example \(Query\) in listing \ref{lst:sml-ex-1} is defined in BNF format.}, label={lst:SML:BNFComp}]
<Keyword> <Argument> (<OptionList>)
AND <Keyword> (<OptionList>) AND <Keyword>
(<OptionList>)
\end{lstlisting}
\begin{lstlisting}[language=python, caption={Examples using the \(READ\) \(Keyword\) in SML.}, label={lst:SML:READ}]
READ "/path/to/data"
READ "/path/to/data" (separator = ";", header = None)
\end{lstlisting}
\begin{lstlisting}[language=python, caption={An example utilizing the \(REPLACE\) \(Keyword\) in SML.}, label={lst:SML:REPLACE}]
READ "/path/to/data" (separator = ";", header = None)
AND REPLACE (missing = "NaN", strategy = "mode")
\end{lstlisting}
\begin{lstlisting}[language=python, caption={Example using the \(SPLIT\) \(Keyword\) in SML.}, label={lst:SML:SPLIT}]
READ "/path/to/data" (separator = ";", header = None)
AND SPLIT (train = 0.8, test = 0.2)
\end{lstlisting}
\begin{lstlisting}[language=python, caption={Example using the \(CLASSIFY\) \(Keyword\) in SML. Here we read in data and create training and testing datasets using the \(READ\) and \(SPLIT\) \(Keyword\)s respectively. We then use \(CLASSIFY\) \(Keyword\) with the first four columns as features and the fifth column to perform classification using a support vector machine.}, label={lst:SML:CLASSIFY}]
READ "/path/to/data" (separator = ";", header = None)
AND SPLIT (train = 0.8, test = 0.2) AND CLASSIFY
(predictors = [1,2,3,4], label=5, algorithm=svm)
\end{lstlisting}
\begin{lstlisting}[language=python, caption={Example using the CLUSTER Keyword in SML. Here we read in data and create training and testing datasets using the READ and SPLIT Keywords respectively. We then use CLUSTER Keyword with the first four columns as features and perform unsupervised clustering with the K-Means algorithm.}, label={lst:SML:CLUSTER}]
READ "/path/to/data" (separator = ";", header = None)
AND SPLIT (train = 0.8, test = 0.2) AND CLUSTER
(predictors = [1,2,3,4], algorithm=kmeans)
\end{lstlisting}
\begin{lstlisting}[language=python, caption={Example using the \(REGRESS\) \(Keyword\) in SML. Here we read in data and create training and testing datasets using the \(READ\) and \(SPLIT\) \(Keyword\)s respectively. We then use \(REGRESS\) \(Keyword\) with the first four columns as features and the fifth column to perform regression on using ridge regression.}, label={lst:SML:REGRESS}]
READ "/path/to/data" (separator = ";", header = None)
AND SPLIT (train = 0.8, test = 0.2) AND CLUSTER
(predictors = [1,2,3,4], label=5, algorithm=ridge)
\end{lstlisting}
\begin{lstlisting}[language=python, caption={Example using the \(LOAD\) and \(SAVE\) \(Keyword\)s in SML.}, label={lst:SML:SAVE_LOAD}]
SAVE "/path/to/save/model"
LOAD "/path/to/save/model"
\end{lstlisting}
\clearpage
\begin{lstlisting}[language=python, caption={Example using the \(PLOT\) \(Keyword\) in SML.}, label={lst:SML:PLOT}]
READ "/path/to/data" (separator = ";", header = None)
AND SPLIT (train = 0.8, test = 0.2) AND CLUSTER
(predictors = [1,2,3,4], algorithm=kmeans)
AND PLOT
\end{lstlisting}
\begin{lstlisting}[language=python, caption={SML \(Query\) that performs classification on the iris dataset using support vector machines. It's important to note that detailed documentation is publicly available in \textsuperscript{\ref{lab:iris:git}} and the purpose of this figure is to highlight the level of the level of complexity relative to an SML query.}, label={lst:SML:IrisQuery}]
from sml import execute
query = 'READ "../data/iris.csv" AND \
SPLIT (train = 0.8, test = 0.2) AND \
CLASSIFY (predictors = [1,2,3,4], label = 5, algorithm=svm) AND \
PLOT'
execute(query, verbose=True)
\end{lstlisting}
\begin{lstlisting}[language=python, caption={SML \(Query\) that performs regression on the Auto-MPG dataset using Linear Regression.}, label={lst:SML:AutoMPGQuery}]
from sml import execute
query = 'READ "../data/auto-mpg.csv" AND \
REPLACE (missing = "?", strategy = "mode") AND \
SPLIT (train = 0.8, test = 0.2) AND \
REGRESS (predictors = [2,3,4,5,6,7,8], label = 1, algorithm=simple) AND \
PLOT'
execute(query, verbose=True)
\end{lstlisting}
%\clearpage
%\appendix
%\clearpage
%\bibliographystyle{plainnat}
%\bibliography{thesisbib}
\bibliographystyle{plainnat}
\nobibliography{thesisbib}