predictor pre-selection

siemdejong · Dec 16, 2022 · dccd15c · dccd15c
1 parent 2fced3b
commit dccd15c
Show file tree

Hide file tree

Showing 2 changed files with 29 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -211,7 +211,7 @@ Yet to be adapted to this study.
         - [ ] Statistical analysis methods
             - [ ] Diagram of analytical process
             - [ ] handling of predictors
-            - [ ] Pre-selection of predictors prior to model building (results for exp/pca/logistic)
+            - [x] Pre-selection of predictors prior to model building (results for exp/pca/logistic)
             - [ ] rescaling/transformation on predictors (LDS + reweighting)
             - [ ] type of model, building model + predictor selection + internal validation
             - [x] model ensembling techniques (if used)

diff --git a/skinstression/chapters/methods.tex b/skinstression/chapters/methods.tex
@@ -93,7 +93,7 @@ \section{Outcome}
 The measurement is done mechanically by an experimentalist.
 The mechanical measurement itself is blind to clinical information.
 
-\section{Predictors}
+\section{Predictors}\label{sec:skin_predictors}
 
 % --------------------------------------------------
 % SEARCHING FOR A SIMPLE SKIN STRAIN-STRESS MODEL
@@ -152,7 +152,7 @@ \subsubsection{Exponential}
 \subsubsection{Principal component analysis}
 In an earlier study (ref A.\ Soylu), principal component analysis (PCA) is used to reduce the dimensionality of the strain-stress data.
 In summary, after PCA, every measurement $Y$ can be approximated by
-\begin{equation}
+\begin{equation}\label{eq:pca}
   Y \approx Y_\mathrm{PCA} = \mathbf{A} \mathbf{V} + \bar{Y},
 \end{equation}
 where $\mathbf{A}$ and $\mathbf{V}$ are matrices containing respectively the eigenvalues and -vectors of the the measurement data.
@@ -195,6 +195,32 @@ \section{Missing data}
 
 \section{Statistical analysis methods}
 
+\subsection{Predictor pre-selection}
+As discussed in \cref{sec:skin_predictors}, there are three candidates to be used as neural network predictors.
+These candidates are tested against the raw strain-stress curves.
+
+\subsubsection{Exponential and logistic curve}
+The exponential and logistic models are fitted to all raw strain-stress curves.
+The goodness of fit is assessed by eye.\marginnote{Whyyyy by eyeeee, simply calculate r2 also for exp :) yo}
+A fit is considered good if it passes reasonably through all data points.
+Moreover, the exponential regime of the fit should describe the leg part of the curve.
+
+\subsubsection{Principal component analysis}
+PCA requires information on at least one axis to align between every curve.
+The first step to achieve this is excluding all stretch values above the stretch of the maximum of the shortest curve.
+\textcite{Soylu2022} did linear interpolation on the curves and restricted both stretch and stress to minim peak value.
+PCA on two variables requires only one shared set of points.
+Moreover, results of \citeauthor{Soylu2022} show knicks in the PCA reconstructions near the end of the curves, which could originate from a limited amount of datapoints or linear interpolation.
+Therefore, in this study, a non-uniform, univariate, interpolating spline was fitted to all points and the stress was calculated from the spline at the stetch values of the curve with the lowest maximum stretch.
+After PCA on the complete dataset, the explained variance per component was calculated and used as a method to find an appropriate number of principal components.
+From these principal components, the curves where reconstructed using \cref{eq:pca}.
+The goodness of fit was determined by eye.
+A fit is considered good if it passes reasonably through all data points and has few inflection points.
+
+Only if PCA on the full dataset works reasonably well, it is possible to use PCA on a subset and use it to reconstruct another subset.
+This would be useful if PCA was used to construct predictors, as using PCA results of the full dataset introduce information leakage from the test sets to the training set, because the components describe data from both subsets.
+This is unlike Ref.~\cite{Soylu2022} where information leakage was not considered.\marginnote{Where to put PCA bias study?}
+
 \subsection{Convolutional neural network}
 The basis of the model originates from Liang \emph{et al.} \cite{Liang2017} and is adapted by Soylu \cite{Soylu2022}.
 The model, a convolutional neural network, consists of five blocks.