paper_eng.tex

\documentclass[twocolumn]{article}
\usepackage{flushend}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[ngerman,english]{babel}
\usepackage{graphicx}
\usepackage{tikz}
\usepackage{hyperref}
\usepackage{cleveref}
\usepackage{framed}
\usepackage{listings}
\usepackage{paralist}
\usepackage{xcolor}
%\usepackage{lineno}

\pagestyle{empty}
\newcommand{\stt}{Soft\-ware\-tech\-nik-Trends}
\raggedbottom
\input{trendsstyle.tex}

\newcommand{\todo}[1]{} %{{\small \textcolor{red}{TODO: #1}}}
\newcommand{\ATDLLMD}{ATD$^{LLM}$D}%

\usepackage{etoolbox}
\patchcmd\thebibliography
 {\labelsep}
 {\labelsep\itemsep=-5pt\relax}
 {}
 {\typeout{Couldn't patch the command}}

\begin{document}
\setlength{\belowcaptionskip}{-10pt}
\twocolumn[{
\Large
\center{\ATDLLMD{}: Acceptance test-driven LLM development}
\vspace{3mm}
\normalsize
\center{David Faragó, Mediform / QPR Technologies, Karlsruhe}
\vspace{3mm}
\slshape
}]

\section{Abstract}
Since the capabilities of Large Language Models (LLMs) have massively increased
in the last years [Brown20, Kaplan20], many new applications based on LLMs are possible,
for instance natural task-oriented phone dialogues like booking an appointment at your doctor’s office.
However, these new applications also pose new challenges in LLM development:
\begin{compactenum}
\item how to merge best practices from distant fields [Studer21, CPMAI].
\item what product and features to develop within and around the LLM, and which aspects to validate,
  and how [Github23, Wang24, Röttger24].
\item what and how to verify and evaluate the LLM [Chang24], especially for safety-critical applications, i.e. tasks that can harm individual users, or even lead to disasters and catastrophes for the population [Anthropic23, Dalrymple24] 
\end{compactenum}

We tackle the three challenges above by adopting an acceptance test-driven development (ATDD) style for LLM development,
baptized \ATDLLMD{}, where the LLM’s training and test sets are extended in each iteration
by user data coming from validation of the previous iteration’s LLM and system around the LLM.
To the best of our knowledge, this has not been done before.
The validation phase supplies the additional or updated (anonymized) data for training and verification of the LLM.
\ATDLLMD{} is made possible by two major innovative solutions: applying the innovative CPMAI process [CPMAI],
and applying our own verification tool, LM-Eval, leading to a red-train-green cycle for LLM development,
which resembles ATDD, but integrates data science best practices.

We rate the safety-criticality of our application and the strength of \ATDLLMD{}'s safety claims,
i.e. its applicability for safety-critical applications, with the help of the Guaranteed Safe AI framework (GS AI) [Dalrymple24].
For this, we list the world model, safety specification, and verification capability of \ATDLLMD{},
and of our ensemble of methods accompanying \ATDLLMD{}. 

Contributions:
\begin{compactitem}
\item Introduction of our novel evaluation and testing tool LM-Eval, which enables customer-centric evaluation metrics for LLMs.
\item The adaptation of CPMAI and LM-Eval to a test-first approach in LLM development, bringing ATDD methodology to LLM development.
\item A case-study in the development of an LLM for a product with high societal importance:
  the phone bot MediVoice [MediVoice], which autonomously manages patient services by phone. We describe the Cognitive Process Management AI (CPMAI) framework,
  which we apply, and show an exemplary roundtrip in this paper. CPMAI intertwines data-centricity, iterations, and agility with focus on business values.
  Furthermore, we incorporate GS AI in a lightweight variant.
\end{compactitem}

Keywords: {\em Large Language Model, LLM, development process, test-first, LLM evaluation, LLM testing, data-centric AI, business-centric AI, Guaranteed Safe AI, GS AI}

\section{Introduction}

The capabilities of LLMs are progressing at a high speed [Minaee24],
enabling better and better natural language understanding, reasoning, and downstream tasks [Kaplan20].
However, due to this fast progress, best practices, processes, and tools in developing and fine-tuning those LLMs are lagging behind,
especially tools for verifying and evaluating the LLMs.
As [Gra23] puts it: LLMs are a technology gifted by aliens without a manual.
This causes three challenges in LLM development and fine-tuning\footnote{Besides the challenge of better model architectures and training techniques. But these are being well researched in academia and industry, e.g. continual pre-training [Ke23] or parameter-efficient fine-tuning [Liu22], quantization [Dettmers22], model merging [Akiba24].}:
\begin{compactenum}
\item {\bfseries Merge best practices} from different fields: Can best practices from both software development (e.g. agile and test-driven) and data science (covering domain knowledge, statistics, ML, visualizations) be integrated into a cohesive, modern, and effective development process for LLMs that is iterative and agile.
\item {\bfseries Validate product and LLM characteristics}: How far can LLMs and their novel capabilities in Natural Language Understanding (NLU) and reasoning be pushed to create
completely new applications, with unprecedented characteristics within the LLM, and thus offer disruptive features to break new ground in application? How can the system and product around the LLM get the most out of the LLM, for instance through some agentic behavior? How do users behave when they use these completely new applications, for instance users that have preconceived skepticism due to previous bad experience with extremely weak phone bots with horrible UX? Given the vast range of characteristics and behaviors, it is impossible to focus on all scenarios and potential features in development and testing, so validation must not only yield use-cases for the desired application and for the LLM characteristics, but also priorities [Röttger24]. Thus validation should answer what products, features and LLM characteristics to develop and which aspects to validate.
\item What and how to {\bfseries verify and evaluate LLMs}: LLMs have an unprecedented NLU and can thus generalize well to various situations and corner cases [Brown20]. Driven by validation, which use-cases and LLM characteristics should evaluation of the LLM focus on, by what kind of metrics that aid in customer-centric evaluations? As neural networks reveal little internal, helpful information and are considered black boxes [Chang24], it is challenging to verify correctness based on further artifacts beyond the text output, e.g. the neural network weights or hidden activations. Furthermore, how to evaluate task-oriented applications that are automated by LLMs in terms of business value presents significant challenges as the LLM output is non-deterministic and in natural language, whose semantics are nuanced and need to be understood for evaluation. The AI behavior needs to be aligned with the business goal, taking user experience and perception into account. Depending on the safety-criticality of the application, a very high level of trust in the appplication's correctness and further desired attributes might be required. So for high safety-criticality, the verification approach needs to provide certain guarantees, and in an auditable way if certification is desired or required.
\end{compactenum}

\section{The MediVoice Use-Case}
The startup Mediform develops the phone bot MediVoice [MediVoice], which autonomously manages patient services by phone, e.g. patients booking appointments, getting prescriptions or referral slips or for answers to general questions in a natural dialogue, just as with a human medical assistant (see Fig. \ref{fig:dialogues} left). MediVoice replaces a medical assistant in natural task-oriented phone dialogues  [Hosseini20] and thus strongly improves patients’ access to medical services as patients will finally be able to reach medical practices by phone again.
Since MediVoice achieves a high degree of autonomy, medical assistants regain time for treating patients in the practice. 

\begin{figure}[hbt!]
\begin{center}
\fbox{\includegraphics[width=0.2\textwidth]{figures/Dialog_HappyPath_En}}\fbox{\includegraphics[width=0.26\textwidth]{figures/Dialog_UnhappyPath_En}}
\caption{Patient dialogues from happy paths (left) and unhappy paths (right)}
\label{fig:dialogues}
\end{center}
\end{figure}

The following subsections describe how the startup Mediform tackles the three challenges listed in the introduction. 

\subsection{Merge best practices}

To develop and fine-tune the MediVoice LLM, Mediform needs to merge best practices from separate fields:
\begin{compactitem}
\item Data Science and Data Engineering, to process thousands of medical practice dialogues: anonymized dialogues from patients, non-AI generated dialogues, AI generated dialogues
\item Machine Learning, covering prompt engineering, pre-training and fine-tuning LLMs, retrieval, and agentic behavior
\item Agile development, to cope with new applications into uncharted realm, as described in the next subsection.
\item Safety engineering, to raise trust by incorporating safety specifications and their verification into the development process,
  as described in the Guaranteed Safe AI framework [Dalrymple24].
\end{compactitem}

\subsection{Validate product and LLM characteristics}

Since stakeholders (medics, medical assistants, health centers, call-centers, patients) do not know in advance what is possible and what they want from the disruptive MediVoice application, their expectations evolve quickly and Mediform needs to be able to quickly update their requirements and targets correspondingly.Thus Mediform needs short iteration cycles with frequent feedback from the stakeholders. 

For instance, through short feedback cycles, Mediform soon discovered interesting findings on how patients behave on the phone:
\begin{compactitem}
\item due to previous bad experiences with extremely weak phone bots with horrible UX, patients are often very skeptical about MediVoice’s capabilities, or quickly become angry or impatient  (see Fig. \ref{fig:dialogues} top right)
\item a dialogue considered successful by a medic is not necessarily successful for the patient (see Fig. \ref{fig:dialogues} bottom right)
\item the unexpected patient behavior that 80+ year olds interact more effectively with MediVoice compared to younger patients [Mediform24].
\end{compactitem}

\subsection{Verify and evaluate LLMs}
\label{sec:verifyandevaluatellms}

Verifying and evaluating LLMs becomes a challenge due to the following reasons:
\begin{compactitem}
\item LLMs learn to handle very broad capabilities and tasks. To gain trust in those AI systems requires additional efforts and new approaches, especially if applied in safety-critical settings. For instance, the Guaranteed Safe AI framework [Dalrymple24] requires three components: Firstly, a world model that describes how the AI system affects the outside world. Secondly, a safety specification that spedifies what effects are acceptable. Thirdly, a verifier that provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model.
\item Using standard metrics and verification tools [Chang24] is too generic because we need to verify our LLMs in a business-centric way, to be able to evaluate that the model behaves correctly with regard to the desired characteristics and features.
  This means we need: Firstly, a suitable world model [Dalrymple24], especially a suitable test driver (e.g. we need to trigger individual business features and processes for each medical practice). Secondly, a suitable safety specification, especially the right test oracles (e.g. ones that depend on the individual business process). Thirdly, business-centric metrics (e.g. for defining correctness or priorities).
\item The LLMs capabilities with respect to NLU and reasoning is hard to evaluate because multiple semantically, but not syntactically close answers can be correct. For instance, we need to determine whether the LLM can cope with broken language, different languages, and speech-to-text errors? Furthermore, the model needs to understand practice specific business processes specified in natural language. The model needs to be able to generalize sufficiently and handle domain-specific corner cases.
\item The LLM is nondeterministic and a black box. Since the model is black box (except for some models offering top-k log-probabilities [Vaswani17]),
  the verification tool can only measure via input-output pairs, and needs to handle variations in the output for a fixed input,
  for instance with the help of semantic similarity and statistical testing.
\end{compactitem}

To estimate what efforts and safety guarantees are required for a specific application, 
Anthropic introduced their AI Safety Levels (ASL), which are a form of Safety Integrity Level (SIL),
as is common in other areas like Electrical/Electronic/Programmable Electronic (ISO 61508) and automotive (ISO 26262).
The ASL determines the level of precaution appropriate for a given AS system, depending on the capabilities of that system
[Anthropic23, Anthropic23b].

Since MediVoice does not provide medical advice, but focuses on patient management like appointment booking,
it does not pose any significant catastrophic risk and is not expected to exhibit dangerous autonomous behaviors.
Thus mediVoice falls under the lowest level, ASL-1.
Operational flaws, such as failing to detect urgent conditions in a leathcare setting,
do not elevate MediVoice to ASL-2 since it does not demonstrate autonomous capabilities with a higher potential for misuse or harm.

Being ASL-1, the GS AI framework does not enforce safety guarantees for MediVoice [Dalrymple24].
Since a failure to detect urgent conditions in a healthcare setting can result in serious harm for individual patients,
we still try to achieve the highest possible safety gurantee using GS AI.


\section{The Solution at Mediform}

Having seen specific instances of the three challenges on the use-case MediVoice in the last section, this section shows Mediform’s solutions to each challenge.

\subsection{Merge best practices}

Most processes in the realm of data science and machine learning are an extension of CRISP-DM [CRISP99], the Cross-Industry Standard Process - Data Mining. For instance, [Farago23] is an extension that only focuses on prompt engineering, while CRISP-ML [Studer21] focuses on automotive and on waterfall processes, ignoring agility, which is necessary for short iterations with customer feedback for business-centric validation, and for modern software development best practices. To the best of our knowledge, there is only one process and framework to enable both a data-centric yet business-centric and agile approach that is modern, AI-specific, and vendor-neutral: the Cognitive Process Management for AI (CPMAI) from Cognilytica, see Fig. \ref{fig:cpmai}. It has data at its core, which is read, modified, or extended in each of the six phases:
\begin{compactitem}
\item {\bfseries Business Understanding}, which gathers an understanding of the business and organizational requirements, and maps them to an AI solution
\item {\bfseries Data Understanding}, which identifies data (quality) requirements and data sources to address the problem
\item {\bfseries Data Preparation}, which prepares the training and test datasets by data cleansing, data aggregation, data augmentation, data labeling, and data transformation
\item {\bfseries Data Modeling}, which produces an AI solution, focusing on developing the ML model by choosing the model architecture, training technique, and hyperparameters, and by model training and optimization
\item {\bfseries Model Evaluation}, which analyzes the confusion matrix and errors, and measures metrics like precision, recall, accuracy, and more business-centric KPIs with respect to quality, performance and other ilities. This helps to decide whether the AI solution meets the goals of the iteration, which are derived by the business or organizational requirements. 
\item {\bfseries Model Operationalization}, performs model versioning, model deployment, model monitoring, and model staging, to deploy the iteration’s AI solution to the stakeholders. 
\end{compactitem}

CPMAI contains more specific details and up-to-date data science and AI best practices, e.g.
\begin{compactitem}
\item which of the common seven patterns of AI (recognition, conversation \& human interaction, predictive analytics \& decisions, goal-driven systems, autonomous systems, patterns \& anomalies, hyper-personalization, see [Cognilytica]) are applied in the current AI solution
\item how to train and optimize ML models
\item data science, big data, and visualization best practices.
\end{compactitem}

In contrast to CRISP-DM [CRISP99], CRISP-ML [Studer21], and other extensions, CPMAI does not merely offer a step from the Model Evaluation phase back to the Business Understanding phase, but has fully adopted modern agile practices: 
\begin{compactitem}
\item it incorporates Model Operationalization with modern DevOps practices, including staging
\item it integrates continuous customer feedback, enabled through staging
\item it offers high flexibility by allowing to move between the six phases, bound together by data-centricity.
\end{compactitem}

\begin{figure}[hbt!]
  \begin{center}
  \vspace{-4mm}
\includegraphics[width=0.4\textwidth]{figures/cpmai}
  \vspace{-4mm}
\caption{Cognilytica’s CPMAI process/framework}
\label{fig:cpmai}
\end{center}
\end{figure}

This flexibility and adoption of modern agile practices allows an easy embedding of up-to-date development best-practices, for instance Test-Driven Development (TDD). TDD is a test-first software development methodology, i.e. tests are written before the actual software, i.e. the system under test (SUT), is developed. In TDD, a red-green-refactor cycle is followed: a newly written test first fails, which motivates updating the SUT, but just with enough development to make the new test pass. The new and all old tests offer a safety-net to enable refactoring of both the tests and the SUT. Finally, we can iterate through the next loop of the red-green-refactor cycle.

CPMAI can easily be adapted to embed TDD:
\begin{compactitem}
\item The Data Preparation phase is the test-first phase, where a failing test is written
\item The Data Modeling phase is the development phase where the SUT is updated
\item The Model Evaluation phase is the subsequent test execution. If a test fails or we want to refactor the SUT, we move back to the Data Modeling phase. If we want to add another test or refactor tests, we move back to the Data Preparation phase.
\end{compactitem}

CPMAI can also embed the Guaranteed Safe AI framework [Dalrymple24]:
\begin{compactitem}
\item The Data Preparation phase updates the safety specification and world model based on the insights gained from the Business Understanding and Data Understanding phases.
\item The Data Modeling phase is applying the verification tool to evaluate the model, ideally with quantifiable safety guarantees.
\item The Model Operationalization phase deploys runtime monitors in case verification should also cover such monitoring.
\end{compactitem}

\subsection{Validate product and LLM characteristics}

CPMAI’s loop from Model Operationalization back to Business Understanding and Data Understanding allows frequent feedback from the stakeholders. For MediVoice, Mediform developed a backend for Interaction Analytics on anonymized patient dialogues: It rates dialogues according to their length, degree of autonomy, and cost. All dialogues are classfied according to the patient's intent and insurance type, and listed, together with a summary, for a drill down, i.e. with a link to a more detailed view for further inspection. Their evaluation (see Fig. \ref{fig:interactionanalytics}) helps in Business Understanding to find failing or not satisfying dialogues. In the Data Understanding phase, we perform error analyses to understand and prioritize use-cases and dialogues with regard to business requirements and business processes. The results of the error analysis help in the Data Preparation phase to update and extend the training and test sets by formulating dialogues that focus on the high priority business requirements and business processes. Thus, all data used within Mediform’s CPMAI is unified to be some form of (anonymized) patient dialogue with metadata: A dialogue, possibly generalized by containing some template variables, based on the desired conversational behavior deduced from the anonymized patient dialogues. The training and test sets cover a wide range of inputs and expected outputs, and are crucial for evaluating the LLM’s performance.

\begin{figure}[hbt!]
  \begin{center}
\includegraphics[width=0.5\textwidth]{figures/InteractionAnalytics}
  \vspace{-8mm}
\caption{Mediform’s Interaction Analytics on anonymized patient dialogues (in German)}
\label{fig:interactionanalytics}
\end{center}
\end{figure}

We extend CPMAI by an ATDD style LLM development: through the above process, the dialogues or dialogue templates within the test set are business-centric and can be considered as acceptance tests (ATs), i.e. as tests that are written  in a customer-centric way, based on the customer’s requirements. Thus ATs define the criteria for the software to be accepted, ensuring that the final product meets the user's needs. By using a TDD-like process (see previous section) within our CPMAI iterations, we extend CPMAI by an ATDD style LLM development, leading to \ATDLLMD{}. As opposed to classical ATDD, \ATDLLMD{} does not require all ATs to pass for overall acceptance of the LLM into production, rather a threshold for a metric like accuracy must be exceeded, as is usual for ML evaluations. When customers know that the LLM has been fine-tuned and tested against customer-centric test cases with a suitable metric, their trust in the system also increases.


\subsection{Verify and evaluate LLMs}

To tackle the challenges of verifying and evaluating LLMs, we develop our own novel testing tool {\bfseries LM-Eval}, enhancing EleutherAI’s Language Model Evaluation Harness [Gao23]. EleutherAI’s tool is known mainly from backing [HF24]. It contains many benchmarks, prompts, and metrics out of the box, but also offers the possibilities to extend them and evaluate custom models.

For LM-Eval (see Fig.~\ref{fig:lmeval}), we added, amongst others:
\begin{compactitem}
\item our own test driver to enable test execution of natural dialogues specified in our own format, with template variables and metadata that may contain fields for business processes individual for the considered medical practice
\item our own business-centric test oracles and metrics to evaluate natural dialogues, with respect to NLU and the domain. For instance, we aggregate statistically, weighting individual results based on their semantic and business relevance (see Fig. \ref{fig:lmeval})
\item features to perform non-deterministic black-box-testing. For instance, we do semantic comparisons (e.g. stricter for pre calls than msg calls, see Fig. \ref{fig:lmeval}) and allow specifying multiple alternative outputs.
\end{compactitem}
With our templated dialogue test sets as custom benchmarks, LM-Eval can evaluate our MediVoice models and cope with the challenges described in Sec.~\ref{sec:verifyandevaluatellms}.

In terms of the Guaranteed Safe AI framework [Dalrymple24], our testing with LM-Eval uses
\begin{compactitem}
\item templated dialogues as world model, which is between world model level W0 and W1 (on a scale from W0 to W5).
\item multiple assistant messages in templated dialogues as safety specification, which is on safety specification level S3 (on a scale from S0 to S7).
\item LM-Eval as verifier, which is on verifier level V2 for fixed template variable substitutions and on level V3 if substituation is randomized on full value sets (on a scale from V0 to V10).
\end{compactitem}

Mediform uses supplementary testing methods accompanying \ATDLLMD{} in an ensemble. This gives further insights and adds further trust.
GS AI also suggests ``defence in depth'' by deploying multiple layers of protection in order to defend against complex threats:

One additional testing method performs smoke tests, where a certain scenario and patient behavior is specified for an LLM to interpret and execute against MediVoice. In terms of GS AI, this has
\begin{compactitem}
\item an LLM with in-context learned patient behavior as world model, on level W1.
\item the specified scenarios with desired control flow as safety specification, on level S3.
\item the LLM focusing verification only on the specified scenarios, which is on level V2.
\end{compactitem}

Mediform also monitors real patient dialogs (anonymized), with a runtime monitor for hallucination detection resp. manual inspetions to catch any remaining bugs and insights.
In terms of GS AI, the hallucination detection has
\begin{compactitem}
\item a hallucination detecting LLM on level W1 as world model
\item a specification on intrinsic and extrinsic hallucination [Ji23] as safety specification, focusing on contradictions between the system message and the dialogue, which is on level S2
\item the set of all anonymized real patient dialogues so far as standardised set of tests, which is on level V2.
\end{compactitem}
The manual inspections have humans with their knowledge and assumptions as world model (level W0) and safety specification (level S1), again on all anonymized real patient dialogues so far (level V2).

\begin{figure}[hbt!]
  \begin{center}
\includegraphics[width=0.5\textwidth]{figures/LMEval2}
  \vspace{-8mm}
\caption{Mediform’s LMEval Testing Framework}
\label{fig:lmeval}
\end{center}
\end{figure}

\subsection{Combining All Solutions}

With CPMAI and LM-Eval in place, we can perform a CPMAI roundtrip with \ATDLLMD{}:
we iteratively collect dialogues from real patients in the Model Operationalization phase,
i.e. the patients’ interactions with our novel phone bot.
We anonymize them in a GDPR-conform way and then perform Business and Data Understanding on them
with the help of our Interaction Analytics dashboard as well as manual exploration.
Using the collected dialogues, especially those with undesired behavior, and insights from error analyses on them,
we update dialogues and possibly the world models and safety specifications,
and we create new dialogues with the corresponding desired behavior in the Data Preparation phase.
The dialogues are specified as training data or acceptance tests that drive the current CPMAI iteration
from red (failing acceptance tests) during Data Preparation,
over Data Modeling (training), to green (passing acceptance tests) during Model Evaluation and Operationalization.
Test execution and evaluation is performed by LM-Eval and supplementary testing methods.
As opposed to classical ATDD, not all acceptance tests need to become green, only enough to achieve the demanded metric, e.g. 90\% accuracy.
This {\bfseries red-train-green cycle} in \ATDLLMD{} resembles ATDD, but integrates data science best practices.
It also resembles CPMAI, but integrates further modern software development, testing, and safety engineering best practices.


\section{Conclusion}

In conclusion, applying ATDD to fine-tuning LLMs represents a significant step forward in developing more reliable, accurate, and user-centric AI language systems. This \ATDLLMD{} approach not only enhances the quality of outputs but also builds user confidence in AI technologies.
Although our application is ASL-1, we give safety guarantees with an enbemble of testing methods, with world models up to level W1, safety specification up to level S3, and verifiers up to level V3.

As future work, we are exploring ways to automate parts of this process, using AI itself to generate and update test scenarios and specifications.
Furthermore, we are trying to directly incorporate feedback from various users to improve our model (as opposed to deriving feedback ourselves, and using the feedback to manually derive new test and training data).


\section{Acknowledgments}

Thanks go to Jochen Krause, CEO of Mediform GmbH, for offering the opportunity to work on this interesting task, and to my colleagues for the great work environment.

Thanks go to Cognilytica, LLC, which have the copyright and trademark for the CPMAI methodology and images (use here provided with written permission).


\section{Bibliography}

\bibliographystyle{unsrt}
\begin{thebibliography}{10}
\footnotesize
%\bibitem{AlSabbagh22} Al-Sabbagh, Khaled Walid et al. {\em Improving test case selection by handling class and attribute noise.} Journal of Systems and Software 183. 2022.
\bibitem{Anthropic23} Anthropic. {\em Anthropic's Responsible Scaling Policy} report. September 19, 2023. https://www-cdn.anthropic.com/1adf000c8f675958c2ee23805d91aaade1cd4613/responsible-scaling-policy.pdf
\bibitem{Anthropic23b} Anthropic. {\em Anthropic's Responsible Scaling Policy} blogpost. September 19, 2023. https://www.anthropic.com/news/anthropics-responsible-scaling-policy.
%\bibitem{Baviskar21} Baviskar, Dipali et al. {\em Multi-Layout Unstructured Invoice Documents Dataset: A Dataset for Template-Free Invoice Processing and Its Evaluation Using AI Approaches.}. IEEE Access, vol. 9. 2021.
%\bibitem{Breck19} Breck, Eric, et al. {\em Data Validation for Machine Learning.} MLSys. 2019.
%\bibitem{Brodley96} Brodley, Carla, and Friedl, Mark. {\em Identifying and eliminating mislabeled training instances.} Proceedings of the National Conference on Artificial Intelligence. 1996.
%\bibitem{Budach22} Budach, Lukas, et al. {\em The Effects of Data Quality on Machine Learning Performance.} arXiv preprint. 2022.
%\bibitem{Cohen60} Cohen, Jacob. {\em A coefficient of agreement for nominal scales.} EPM. 1960.
%\bibitem{DAI} Datacentric AI Resource Hub: https://datacentricai.org.
\bibitem{Dalrymple24} David ``davidad'' Dalrymple et al. {\em Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems}. July 2024.
%\bibitem{DMT} https://huggingface.co/blog/data-measurements-tool. 
%\bibitem{Felderer19} Foidl, Harald, and Michael Felderer. {\em Risk-based data validation in machine learning-based software systems.} Proceedings of the 3rd ACM SIGSOFT. 2019.
%\bibitem{GE} Great Expectations Blog Post. {\em You Are What You Eat: Why Data Quality Matters for Machine Learning.} https://greatexpectations.io/blog/why-data-quality-matters-for-machine-learning. 
%\bibitem{InvoiceLG} {\em Concise Invoice Labeling Guide}. http://tinyurl.com/InvoiceLabelingGuide. 
\bibitem{Ji23} Ji et al. {\em Survey of hallucination in natural language generation}. ACM Computing Surveys 55.12 (2023).
%\bibitem{Neves21} Mariana Neves, Jurica Seva. {\em An extensive review of tools for manual annotation of documents.} Briefings in Bioinformatics, Volume 22, Issue 1, January 2021, Pages 146–163, https://doi.org/10.1093/bib/bbz130. 2021
%\bibitem{Northcutt21} Northcutt, Curtis, Lu Jiang, and Isaac Chuang. {\em Confident learning: Estimating uncertainty in dataset labels.} Journal of Artificial Intelligence Research 70. 2021.
%\bibitem{Paullada20} Paullada, Amandalynne, et al. {\em Data and its (dis) contents: A survey of dataset development and use in machine learning research.} Patterns 2.11. 2021.
%\bibitem{CLIP} Radford, Alec, et al. {\em Learning transferable visual models from natural language supervision.} International Conference on Machine Learning. PMLR. 2021.
%\bibitem{Sambasivan21} Sambasivan, Nithya, et al. {\em ``Everyone wants to do the model work, not the data work'': Data Cascades in High-Stakes AI.} Proceedings of the CHI Conference on Human Factors in Computing Systems. 2021.
%\bibitem{Batini06} Scannapieco, Monica. {\em Data Quality: Concepts, Methodologies and Techniques. Data-Centric Systems and Applications.} Springer. 2006.
%\bibitem{Scheuerman21} Scheuerman, Morgan Klaus et al. {\em Do datasets have politics? Disciplinary values in computer vision dataset development.} Proceedings of the ACM on HCI. 2021.
%\bibitem{Sluban10} Sluban, Borut, Gamberger, Dragan, and Lavra, Nada. {\em Advances in class noise detection.} ECAI. 2010.
%\bibitem{Tagliabue21} Tagliabue, Jacopo. {\em You do not need a bigger boat: recommendations at reasonable scale in a (mostly) serverless and open stack.} Fifteenth ACM Conference on Recommender Systems. 2021.
%\bibitem{Tseng20} Tseng, Tina, Amanda Stent, and Domenic Maida. {\em Best Practices for Managing Data Annotation Projects.} arXiv preprint arXiv:2009.11654. 2020.

%\bibitem{Barrett19} Barrett, Leslie, and Michael W. Sherman. {\em Improving ML Training Data with Gold-Standard Quality Metrics}. 2019.
  
\end{thebibliography}
\end{document}