feat: change to sidecite, add derivations

siemdejong · Jun 15, 2023 · 37a457e · 37a457e
1 parent 407b94f
commit 37a457e
Show file tree

Hide file tree

Showing 24 changed files with 136 additions and 79 deletions.
diff --git a/.gitignore b/.gitignore
@@ -247,3 +247,4 @@ TSWLatexianTemp*
 
 # generated if using elsarticle.cls
 *.spl
+draft-2_mscthesis.pdf
diff --git a/ANN/images/hhg-jablonski.png b/ANN/images/hhg-jablonski.png
diff --git a/ANN/images/hhg-jablonski.pptx b/ANN/images/hhg-jablonski.pptx
diff --git a/ANN/theory.tex b/ANN/theory.tex
@@ -47,7 +47,7 @@ \subsection{Artificial neural networks}
 During training, the network adjusts the strengths of connections between neurons, known as weights, based on the patterns and relationships in the input data.
 This process allows the network to recognize and generalize from examples, making it capable of solving complex problems and making predictions.
 
-Optimizing ANNs often relies on backpropagation (from backward propagation of errors)~\cite{Rumelhart1986}.
+Optimizing ANNs often relies on backpropagation (from backward propagation of errors)~\sidecite{Rumelhart1986}.
 Mathematically, an ANN $g$ with $L$ layers and activation function $f$ can be described as
 \begin{equation}
     \hat{y} = g(x) = f^L \left\{W^L f^{L-1} \left[W^{L-1} \cdots f^1 \left(W^1 x\right) \cdots \right] \right\},
@@ -60,7 +60,7 @@ \subsection{Artificial neural networks}
 \end{equation}
 where $\eta$ is the learning rate.
 This optimization algorithm is called stochastic gradient descent (SGD).
-Backpropagation and SGD form the basis of neural network optimization, but there are other optimization algorithms available such as Adam~\cite{Kingma2014AdamAM}.
+Backpropagation and SGD form the basis of neural network optimization, but there are other optimization algorithms available such as Adam~\sidecite{Kingma2014AdamAM}.
 
 \subsection{Convolutional layers}
 To distinguish a neural network from a convolutional neural network (CNN), at least one layer must be a convolution.
@@ -215,7 +215,7 @@ \subsubsection{Regression}
 \end{equation}
 
 \paragraph{Focal MSE}
-To give even more focus on the hard targets, giving them more importance than easy targets can be done through the focal MSE loss (FL)~\cite{Lu2022}.
+To give even more focus on the hard targets, giving them more importance than easy targets can be done through the focal MSE loss (FL)~\sidecite{Lu2022}.
 To give less importance to the easier targets, FL follows
 \begin{equation}
     FL = \left(\frac{2}{1 + e^{-\beta |y - y'|}} - 1 \right)^\gamma (y_i - y'_i)^2,
@@ -272,11 +272,11 @@ \subsection{Training}\label{Training}
 \subsubsection{Dropout}\label{sec:dropout}
 Overfitting can be reduced by applying methods of regularization.
 One regularization method is dropout.
-It prevents neurons from co-adapting, which would otherwise reduce the chance of the model to perform well on external validation sets \cite{Srivastava2014}.
+It prevents neurons from co-adapting, which would otherwise reduce the chance of the model to perform well on external validation sets \sidecite{Srivastava2014}.
 With dropout, individual neurons are activated with probability $p$, effectively dropping neurons randomly.
 
 \subsubsection{Batch normalization}\label{sec:bn}
-Batch normalization (BN)~\cite{Ioffe2015} is a technique to shift and scale batches akin to standardization.
+Batch normalization (BN)~\sidecite{Ioffe2015} is a technique to shift and scale batches akin to standardization.
 It can be implemented as a layer in any neural network.
 Per minibatch and per dimension, the mean and standard deviation of the input are calculated.
 Then, the input is standardized with
@@ -292,7 +292,7 @@ \subsubsection{Batch normalization}\label{sec:bn}
 
 When batch normalization is applied after a convolutional layer, the bias term of the convolution becomes redundant and can be set to zero to avoid unnecessary operations.
 
-BN has been shown to have a regularizing effect~\cite{Bjorck2018}, although combining it with dropout is disputed.
+BN has been shown to have a regularizing effect~\sidecite{Bjorck2018}, although combining it with dropout is disputed.
 More often than not, using both BN and dropout leads to worse results on the test set.
 
 \subsubsection{Model ensembling}\label{subsec:model_ensembling}
@@ -339,7 +339,7 @@ \subsubsection{Grid search and random search}
 With grid search, parameters are sampled exhaustively using equidistant spacing in each dimension.
 
 A drawback of grid search is that optima can reside outside the hyperparameter set that grid search produces.
-Random search \cite{Bergstra2012} aims to find optima in the gaps using random search.
+Random search~\sidecite{Bergstra2012} aims to find optima in the gaps using random search.
 With the same number of trials, random search has a higher probability for trials to find the global optimum.
 This is because trials explore the whole distribution as opposed to just a few points in individual dimensions.
 \Cref{fig:gridrandsearch} shows the differences between grid search and random search and advocates the use of the latter.
@@ -356,7 +356,7 @@ \subsubsection{Grid search and random search}
 
 \subsubsection{Tree Parzen estimator}
 Still, random search requires trials in regions that are unpromising which is inefficient.
-A tree-structured Parzen estimator (TPE)~\cite{Bergstra2011} approach aims to model the probability of a hyperparameter\footnote{Or a set of hyperparameters in the case of multivariate TPE~\cite{Falkner2018}}, given a loss value.
+A tree-structured Parzen estimator (TPE)~\sidecite{Bergstra2011} approach aims to model the probability of a hyperparameter\sidenote{Or a set of hyperparameters in the case of multivariate TPE~\cite{Falkner2018}}, given a loss value.
 That probability consists of two distributions, describing the good and bad values:
 \begin{equation}
     p(c|L) =
@@ -374,14 +374,14 @@ \subsubsection{Tree Parzen estimator}
     \mathrm{promisingness}(c) \propto p(c|\mathrm{good}) / p(c|\mathrm{bad})
 \end{equation}
 is high.
-Ref.~\cite{Bergstra2011} shows that this ratio is proportional to the expected improvement~\cite{Jones2001}.
+Ref.~\sidecite{Bergstra2011} shows that this ratio is proportional to the expected improvement~\sidecite{Jones2001}.
 The configuration responsible for the maximum of $\mathrm{promisingness}(c)$ is used as the next trial.
 Results of that trial are now categorized as good or bad, and the iterative process continues.
 
 \subsubsection{Successive Halving and Hyperband}
 Although $\mathcal{C}$ can be sampled more efficiently with TPE, trials still use the full computational budget, even if it is apparent that the trial is unpromising early on.
 Early terminating (or pruning) these underperforming trials speeds up hyperparameter optimization.
-Pruning trials can be done using Successive Halving (SH)~\cite{Jamieson2016}.
+Pruning trials can be done using Successive Halving (SH)~\sidecite{Jamieson2016}.
 Given a computational budget $B$, \eg number of epochs, the number of trials $T$, and the halving rate $\gamma$, SH performs $\log_\gamma(T)$ rounds.
 The budget is distributed uniformly over the trials.
 Every round, $\qty{100}{\percent} \times 1/\gamma$ of the trials are discarded based on their performance.
@@ -395,7 +395,7 @@ \subsubsection{Successive Halving and Hyperband}
 There is a trade-off between $T$ and $B$.
 Suppose $T$ is large, then each trial gets a small amount of budget, but many configurations are explored.
 Conversely, if $T$ is small, then each trial gets much budget, at the cost of exploring the number of configurations.
-This $T/B$ trade-off is addressed by Hyperband (HB)~\cite{Li2016} by performing a grid search over feasible values of $T$.
+This $T/B$ trade-off is addressed by Hyperband (HB)~\sidecite{Li2016} by performing a grid search over feasible values of $T$.
 HB invokes SH multiple times.
 Every invocation of SH is called a bracket.
 In the end, HB returns the best configuration possible just like SH, but diminishing the dependence on manually choosing a good $T$.
@@ -414,7 +414,7 @@ \subsection{Image quality}\label{subsec:imq}
 A convolutional neurel network receives a stream of input images with varying quality.
 For example, microscopy images from deep inside tissue are presumably noisier and/or less bright than images taken near the surface.
 Neural networks have trouble learning from bad images, as they lack structures that trigger neurons to output predictions close to targets.
-Excluding noisy images might increase performance \cite{Blokker2022}.
+Excluding noisy images might increase performance \sidecite{Blokker2022}.
 \textcite{Koho2016} suggest some measures to quantify image quality.
 Here, the entropy and kurtosis are discussed.
 
@@ -433,7 +433,7 @@ \subsubsection{Shannon entropy}
 \begin{equation}
     H_I = -\sum_{i}^n P_i \log_2 P_i,
 \end{equation}
-where $P_i$ is the normalized image histogram at bin index $i$ \cite{Koho2016}.
+where $P_i$ is the normalized image histogram at bin index $i$ \sidecite{Koho2016}.
 The base of the logarithm is chosen to be two, such that the entropy is in units of bits.
 
 For images having many different intensities, the entropy is high, because of the knowledge that pixel intensities having lower probability.
@@ -458,7 +458,7 @@ \subsubsection{Kurtosis}
 Negative kurtosis means that the distribution is platykurtic.
 Platykurtic distributions have thinner tails, such as the Bernoulli distribution.
 
-Kurtosis can be calculated on the upper part of the power spectrum of an image \cite{Koho2016,Blokker2022}.
+Kurtosis can be calculated on the upper part of the power spectrum of an image \sidecite{Koho2016,Blokker2022}.
 If the upper part of the power spectrum is very leptokurtic compared to other images in the dataset, it may indicate that the image is an outlier and is significantly different from the mean.
 
 \subsection{Explainable AI}
@@ -475,7 +475,7 @@ \subsection{Explainable AI}
 XAI gives users and patients more confidence in the prediction so that specialists can proceed with treatment.
 
 \subsubsection{Occlusion}\label{subsec:occlusion}
-Occlusion \cite{Zeiler2013} is an XAI pertubation technique.
+Occlusion \sidecite{Zeiler2013} is an XAI pertubation technique.
 The method replaces patches of the input with a baseline value.
 \eg for images, patches can consist of any shape and the baseline value can be 0, practically making a patch black and removing all information at the patch's location and the connections with neighbouring pixels.
 In the original paper, occlusion is used to systematically cover parts of foreground objects, to gain confidence in the AI using foreground objects to predict the output.

diff --git a/frontbackmatter/appendix-sclicom.tex b/frontbackmatter/appendix-sclicom.tex
@@ -1,3 +1,39 @@
+\section{Derivation of ROC and PR curve baselines}
+\subsection{ROC curve baseline}\label{subsec:roc-curve-baseline-derivation}
+Given that accuracy
+\begin{align}
+    \text{acc} &= \pi \cdot\text{TPR} + (1 - \pi) (1 - \text{FPR}),
+\end{align}
+which can be rewritten to $\text{TPR}$ in terms of $\text{FPR}$,
+\begin{align}
+    \text{TPR} &= \frac{\text{acc}}{\pi} - \frac{(1-\pi)}{\pi}(1 - \text{FPR}) \\
+               &= \frac{\text{acc}}{\pi} - \frac{(1-\pi)}{\pi} + \frac{1-\pi}{\pi}\text{FPR} \\ 
+               &= \frac{\text{acc} - 1 + \pi}{\pi} + \frac{1-\pi}{\pi}\text{FPR}.
+\end{align}
+In the case of an always positive classifier that is right for $\pi$ of the time,
+\begin{align}
+    \text{TPR}(\text{acc} = \pi) = \frac{1-\pi}{\pi}\text{FPR} + 2 - \frac{1}{\pi}.
+\end{align}
+
+\subsection{PR curve baseline}\label{subsec:pr-curve-baseline-derivation}
+Given that the $F_1$-score is
+\begin{align}
+    F_1 &= \frac{2}{\text{prec}^{-1} + \text{rec}^{-1}},
+\end{align}
+which can be rewritten as
+\begin{align}\label{eq:prec-in-terms-of-rec}
+    \text{prec} = \frac{F_1 \cdot \text{rec}}{2\cdot \text{rec} - F_1}.
+\end{align}
+The $F_1$-score of an always positive classifier is
+\begin{align}
+    F_{1, +} &= \frac{2 \cdot \text{prec}}{\text{prec} + 1},
+\end{align}
+since in this case $\text{prec} = \pi$ and $\text{rec} = 1$.
+Substituting $F_1$ in \cref{eq:prec-in-terms-of-rec} for $F_{1, +}$, we get
+\begin{align}
+    \text{prec} = \frac{F_{1, +} \cdot \text{rec}}{2\cdot \text{rec} - F_{1, +}}.
+\end{align}
+
 \section{Flow of images to splits}\label{app:folds-splits-viz}
 The splits are created as described in \cref{subsubsec:slicom-folds}.
 The process is visualized in \cref{fig:folds-splits-viz}.
@@ -15,3 +51,5 @@ \section{Flow of images to splits}\label{app:folds-splits-viz}
     }
     \label{fig:folds-splits-viz}
 \end{figure*}
+
+\
diff --git a/general_introduction.tex b/general_introduction.tex
@@ -25,7 +25,7 @@ \section{Deep learning for higher harmonic microscopy}
 
 \section{Reporting of clinical artificial intelligence}
 The prediction models described in this work may eventually aid health care providers in acquiring clinically relevant parameters or estimating the probability of risk that an outcome is present.
-The Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) Initiative developed guidelines to report on such diagnostic models \cite{Collins2015, Moons2015, Heus2020}.
+The Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) Initiative developed guidelines to report on such diagnostic models \sidecite{Collins2015, Moons2015, Heus2020}.
 Recent advances in artificial intelligence (AI) apply AI as black box predictive models in health care, often not sufficiently well reported.
 Transparent reporting on these black box models builds confidence in using and further developing the models.
 This is especially important in health care, where there is a need for automation while trust in AI is yet to be earned.
@@ -34,5 +34,5 @@ \section{Reporting of clinical artificial intelligence}
 Unlike machine learning (ML) models, AI models learn by recognizing patterns.
 These patterns are then used in inference to make a prediction, possibly of clinical value.
 A clinician should then be explained how the model came to its conclusion, along with its confidence.
-To account for these challenges, an extension for the TRIPOD statement, TRIPOD-AI is currently being developed \cite{Collins2021,Collins2020}.
-Reports on the diagnostic models developed in this study aim to adhere to TRIPOD-AI as well as possible\footnote{The reader is invited to use the TRIPOD-AI accompanying PROBAST-AI \cite{Wolff2019a, Wolff2019b, Collins2021} checklist to assess the risk of bias of the predictive models.}.
+To account for these challenges, an extension for the TRIPOD statement, TRIPOD-AI is currently being developed \sidecite{Collins2021,Collins2020}.
+Reports on the diagnostic models developed in this study aim to adhere to TRIPOD-AI as well as possible\sidenote{The reader is invited to use the TRIPOD-AI accompanying PROBAST-AI \cite{Wolff2019a, Wolff2019b, Collins2021} checklist to assess the risk of bias of the predictive models.}.
diff --git a/library.bib b/library.bib
@@ -1674,6 +1674,24 @@ @article{Li2016
   bibsource  = {dblp computer science bibliography, https://dblp.org}
 }
 
+@article{Ma2018,
+  author       = {Ningning Ma and
+                  Xiangyu Zhang and
+                  Hai{-}Tao Zheng and
+                  Jian Sun},
+  title        = {ShuffleNet {V2:} Practical Guidelines for Efficient {CNN} Architecture
+                  Design},
+  journal      = {CoRR},
+  volume       = {abs/1807.11164},
+  year         = {2018},
+  url          = {http://arxiv.org/abs/1807.11164},
+  eprinttype    = {arXiv},
+  eprint       = {1807.11164},
+  timestamp    = {Tue, 21 Dec 2021 10:11:08 +0100},
+  biburl       = {https://dblp.org/rec/journals/corr/abs-1807-11164.bib},
+  bibsource    = {dblp computer science bibliography, https://dblp.org}
+}
+
 @article{Heus2020, title={Transparent reporting of multivariable prediction models in Journal and conference abstracts: Tripod for abstracts}, volume={173}, DOI={10.7326/m20-0193}, number={1}, journal={Annals of Internal Medicine}, author={Heus, Pauline and Reitsma, Johannes B. and Collins, Gary S. and Damen, Johanna A.A.G. and Scholten, Rob J.P.M. and Altman, Douglas G. and Moons, Karel G.M. and Hooft, Lotty}, year={2020}, pages={42–47}} 
 
 @article{Litjens2017,

diff --git a/pediatric-brain-tumours/sections/discussion/implications.tex b/pediatric-brain-tumours/sections/discussion/implications.tex
@@ -12,5 +12,5 @@ \subsubsection{Attention weighted images might be used for visual guidance}
 
 \subsubsection{Prediction may be consulted as validation}
 When working with time constraints, such as in an intraoperative setting, human mistakes may occur more frequently.
-Moreover, pathology using HHG microscopy is not well-established yet, so pathologists would need to be trained on HHG images~\cite{Spies2023}, and therefore they make more errors in the beginning of using this modality.
+Moreover, pathology using HHG microscopy is not well-established yet, so pathologists would need to be trained on HHG images~\sidecite{Spies2023}, and therefore they make more errors in the beginning of using this modality.
 If the model is further improved until a desired performance is reached, the prediction may serve as a (non-binding) diagnosis validation.
diff --git a/pediatric-brain-tumours/sections/discussion/interpretation.tex b/pediatric-brain-tumours/sections/discussion/interpretation.tex
@@ -14,11 +14,11 @@ \subsection{SimCLR clusters features that represent similar structures}
 Ideally, SimCLR maps tiles to features in as many clusters as there are target classes, such that the classes can be easily separated.
 In the t-SNE projections (\cref{fig:tsne-features}), the classes seem reasonably well separated, which is more apparent at higher t-SNE perplexities.
 However, feature projections of multiple images with the same diagnosis are rarely packed together.
-This can likely be improved by using larger feature extraction backbones or more sophisticated self-supervised training scheme like SwAV~\cite{Caron2020}.
+This can likely be improved by using larger feature extraction backbones or more sophisticated self-supervised training scheme like SwAV~\sidecite{Caron2020}.
 
 \subsection{The PMC-HHG dataset is too small to make distinguishing statements on model performance}
 The errors on the test AUPRG are \qty{122 \pm 106}{\percent}, which is most probably the result of a small dataset.
-As shown in \cite{Schirris2022}, performance metrics are expected to drastically increase when increasing the number of samples.
+As shown in \sidecite{Schirris2022}, performance metrics are expected to drastically increase when increasing the number of samples.
 With a larger number of samples, there is a higher probability that features of training and testing data are similar, which ought to improve performance.
 
 \subsection{Domain-specific and ImageNet pretrained feature extractors have comparable performance}

diff --git a/pediatric-brain-tumours/sections/discussion/limitations.tex b/pediatric-brain-tumours/sections/discussion/limitations.tex
@@ -14,9 +14,9 @@ \subsubsection{Models of fold 1 were overfit}
 \subsubsection{Tile size was not varied}
 The HHG microscope can image tissue with a resolution of \qty{0.2}{mpp}.
 The tiles that are presented to the model are \qty{44.8}{\micro\meter}$\times$\qty{44.8}{\micro\meter}.
-Medulloblastomas are characterized by the absence of increased cell size (max.~$\sim\qty{32}{\micro\meter}$~\cite{Orr2020}), among others.
+Medulloblastomas are characterized by the absence of increased cell size (max.~$\sim\qty{32}{\micro\meter}$~\sidecite{Orr2020}), among others.
 This is smaller than the tile size.
-Pilocytic astrocytomas develop from astrocytes and their processes are about \qty{97.9}{\micro\meter}~\cite{Vasile2017}, which is larger than the tile size.
+Pilocytic astrocytomas develop from astrocytes and their processes are about \qty{97.9}{\micro\meter}~\sidecite{Vasile2017}, which is larger than the tile size.
 It might be beneficial for the model to work with tile sizes larger than key disease features, otherwise the model may have more difficulty recognizing specific disease patterns.
 In a future study, the effect of using tiles that are about the size of disease features should be studied.