Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistency between Poisson loglikelihood value, gradient (when using multiplicative factors) and Hessian due to thresholding #1472

Open
KrisThielemans opened this issue Jul 12, 2024 · 12 comments · May be fixed by #1482
Assignees

Comments

@KrisThielemans
Copy link
Collaborator

KrisThielemans commented Jul 12, 2024

Continuing from #1461...

We attempt to avoid zero or negatives in the log (or division be zero or negatives in the gradient) by thresholding the quotient. Unfortunately, the code is not consistent, even after many attempts.

First introducing notation:
$T(e, y) = max(e, y/Q)$
(i.e. lower threshold on $e$), and (full) forward projection $e = P\lambda+b$.

formulas

A strategy could be

$ \log L = \sum_i y_i \log ( T(e_i,y_i) ) - T(e_i,y_i)$ [1]

This function has a gradient w.r.t. $\lambda$ (with some element-wise multiplications etc., and ignoring the non-differentiable points)

$ { P^T {\partial T \over \partial e} (e, y) ( {y \over T(e,y) } - 1) }$ [2]

with

$ {\partial T \over \partial e} T(e,y) = 1$ if $e>= y/Q$ and $0$ otherwise [3]

This gradient is equivalent to back-projecting

e >= y/Q ? y/e : 0 [4]

Log-likelihood code

The function [1] is what is computed by PoissonLogLikelihoodWithLinearModelForMeanAndProjData for the value via

const float new_estimate = max(estimated_projections[r][b], projection_data[r][b] / max_quotient);
if (projection_data[r][b] <= small_value)
sub_result += -double(new_estimate);
else
sub_result += projection_data[r][b] * log(double(new_estimate)) - double(new_estimate);

where the arguments are computed
forward_projector_sptr->forward_project(estimated_viewgrams);
if (additive_binwise_correction_ptr != NULL)
{
estimated_viewgrams += (*additive_binwise_correction_ptr);
};
if (mult_viewgrams_ptr != NULL)
{
estimated_viewgrams *= (*mult_viewgrams_ptr);
}
RelatedViewgrams<float>::iterator meas_viewgrams_iter = measured_viewgrams_ptr->begin();
RelatedViewgrams<float>::const_iterator est_viewgrams_iter = estimated_viewgrams.begin();
// call function that does the actual work, it sits in recon_array_funtions.cxx (TODO)
for (; meas_viewgrams_iter != measured_viewgrams_ptr->end(); ++meas_viewgrams_iter, ++est_viewgrams_iter)
accumulate_loglikelihood(*meas_viewgrams_iter, *est_viewgrams_iter, rim_truncation_sino, log_likelihood_ptr);

Gradient code

For the gradient we use the optimisation that any multiplicative factors drop out.

forward_projector_sptr->forward_project(estimated_viewgrams);
if (additive_binwise_correction_ptr != NULL)
estimated_viewgrams += (*additive_binwise_correction_ptr);
// for sinogram division
divide_and_truncate(*measured_viewgrams_ptr, estimated_viewgrams, rim_truncation_sino, count, count2, log_likelihood_ptr);
// adding the sensitivity: backproj[y/ybar] *
// not adding the sensitivity computes the gradient: backproj[y/ybar - 1] *
// * ignoring normalisation *
if (!add_sensitivity)
{
if (mult_viewgrams_ptr)
{
// subtract normalised ones from the data [y/ybar - 1/N]
*measured_viewgrams_ptr -= *mult_viewgrams_ptr;
}
else
{
// No mult_viewgrams_ptr, subtract ones [y/ybar - 1]
*measured_viewgrams_ptr -= 1;
}
}
// back project
back_projector_sptr->back_project(*measured_viewgrams_ptr);

Unfortunately, the divide_and_truncate method (which implements the division [4]) does not know about the multiplicative factors, and is therefore using a different threshold than the logL value.

Gradient formulas

Writing the full forward projection as $e= P\lambda + b = m (G\lambda+a)$, and the forward projection without $m$ as $ f = G(\lambda+a)$, the current gradient computation is

G.back_project((f >= y/Q ? y/f : 0) - m)

This is equivalent to

P.back_project((e >= m*y/Q ? y/e : 0) - 1)

Conclusion

There are 2 inconsistencies:

  • In both divide_and_truncate and accumulate_likelihood, Q=10000, such that the quotient in the gradient computation is not consistent with the log-term in the logL value wherever only one of those reaches the threshold, so if $m&lt;1$ where $m y/Q &lt; e &lt; y/Q $.
  • the "sensitivity" term uses a threshold in the current logL computation, but not in the gradient computation.
@KrisThielemans KrisThielemans self-assigned this Jul 12, 2024
@KrisThielemans
Copy link
Collaborator Author

KrisThielemans commented Jul 12, 2024

Note that in most cases, $m$ is the product of attenuation factors (of the order .001 to 1) and "efficiencies", which for most scanners (without calibration) are around 1, so $m$ is likely less than 1. It also means that the threshold (on $e$) in the gradient calculation is currently lower than 10000, and will probably never be reached, which is good. (The logL calculation will not correspond to the gradient for $y$ for slightly larger $e$, but these are still small). However, as soon as we'd include duration, this won't be the case (or calibration factors, which I'm not sure what values they'd have).

Side note: the code has separate handling for the $y=0$ (or actually, y<= max(viewgram)* 0.000001) case, in which case the log terms gets dropped, as required.

Current impact estimation:

  • any algorithm that doesn't use the value of the log-likelihood is not affected by this discrepancy
  • as long as $m&lt;=1$ (i.e. "normalisation factors" are around 1 or higher), the gradient-threshold is unlikely to be reached, and definitely not when $b&gt;0$. In any case, it only matters where $y&gt;0$.

Nevertheless, these thresholds should be consistent, and in any case are probably surprising to most people (as not documented). Also, the current gradient strategy is sensitive to the scale of $m$, which is dangerous.

@KrisThielemans
Copy link
Collaborator Author

After discussion with @mehrhardt, proposed strategy in first instance is

  • make threshold configurable
    • allow to switch it off, in which case logL or its gradient could be undefined, NaN or infinite
    • or increase $Q$ such that is less likely to be hit)
  • use threshold in terms of $e$, not $f$ (this implies passing $m$ to divide_and_truncate)
  • document behaviour

Things to consider:

  • use soft thresholding to keep things differentiable
  • move thresholding to the algorithm, as opposed to the objective function.

@KrisThielemans KrisThielemans changed the title inconsistency between Poisson loglikelihood value and gradient when using multplicative factors due to thresholding inconsistency between Poisson loglikelihood value and gradient when using multiplicative factors due to thresholding Jul 12, 2024
@KrisThielemans
Copy link
Collaborator Author

KrisThielemans commented Jul 12, 2024

There's another possible choice for the objective function:

$ \log L = \sum_i y_i \log ( T(e_i,y_i) ) - e_i$ [5]

This choice has a few advantages (and is also preferred by @mehrhardt):

  • It could be more natural to only handle problems in the log term
  • It allows computing the latter term via the sensitivity (it is equivalent to $\sum_v \mathrm{sensitivity}_v \lambda_v $). Otherwise, at least in principle we have to take the bins into account where the threshold was applied. Aside from being very messy, [1] is also virtually impossible to implement in the list-mode objective function.

After more discussions with @NicoleJurjew, two other things came up:

  • if the threshold depends on $y$, it cannot be reproduced in the list-mode objective function
  • to save time/memory, the gradient calculation can be done without the "sensitivity" term (this is called add_sensitivity=true, as it computes the gradient + sensitivity). In that case, $m$ is not passed to the above function (see here. That means that we cannot use a threshold strategy that depends on $m$ for the gradient.

Proposal

For $f= G\lambda+a$, implement

$ \log L = \sum_i y_i \log ( \max(f_i,\epsilon) ) - m_i*f_i$ [6]

with $\epsilon$ a user parameter, hopefully defaulting to 0. Note that this will NOT be backwards compatible for ProjDatawith either the current log-likelihood nor the gradient computation, whenever the thresholds are reached (but this should be really infrequently). Note that even without thresholds, [5] differs from [1] with a global constant (depending on $m$), but it is compatible with the list-mode data objective function value

@mehrhardt @gschramm any comments?

@KrisThielemans
Copy link
Collaborator Author

KrisThielemans commented Jul 12, 2024

Other things noted while investigating all this:

  • divide_and_truncate allows accumulating the "log-likelihood", but it is wrong unless the measured data would contain m.
  • the listmode obj function calculation seems to miss an else for the non-threshold case.

@gschramm
Copy link
Contributor

To me, [5] / [6] with a default eps = 0, feel most "natural".

How often does the "log(0)" problem occur in real PET/SPECT systems where there should be always a finite amount of contamination (except for SPECT scans without scatter)?

@KrisThielemans
Copy link
Collaborator Author

KrisThielemans commented Jul 22, 2024

How often does the "log(0)" problem occur in real PET/SPECT systems where there should be always a finite amount of contamination (except for SPECT scans without scatter)?

In "traditional" algorithms without precorrections, and the background model is somewhat accurate, then I don't think it should ever occur. However, there are a few cases where it could occur

  • people do not always do the "proper" things: examples
    • NAC reconstructions without background
    • reconstructions with a noisy "background" estimate
    • reconstruction of precorrected data (still very often done in SPECT where scatter is precorrected (badly))
    • initialisation with an image that has too small support (i.e. is zero where it shouldn't). This can also happen with OSEM (where a subset sets a voxel to zero when it shouldn't be).
  • step-size "overshoot", e.g. in (somewhat careless) line-searches
  • algorithms that explicitly allow negative voxel values, see some refs in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5854483/. Depending on algorithm/implementation, these thresholds might be useful or not.

The challenge for STIR is that most people expect it to "just work" even if they throw weird stuff at it...

@gschramm
Copy link
Contributor

gschramm commented Jul 22, 2024

I see. I think some negative pixels, as long as the forward model stays positive, are ok.

Wouldn't it make more sense to return infinity as soon as a single bin in the expectation is <= 0?
Of course that would require checks slowing things down and we would lose differentiability.

@KrisThielemans
Copy link
Collaborator Author

I think some negative pixels, as long as the forward model stays positive, are ok.

I don't disagree, but that's a different optimisation problem from what most people will expect.

Wouldn't it make more sense to return infinity as soon as a single bin in the expectation is <= 0? Of course that would require checks slowing things down and we would lose differentiability.

Not all architectures can represent infinity (although these days, that's possibly less of a worry, as the IEEE floating point standard is now very widely adapted and I see that even CUDA represents infinity).

Clearly, the above thresholding strategies all require checks and will slow things down, and indeed the proposed function is not differentiable. An alternative is to just not do any checks, and let the result be undefined. In my experience, that throws up problems very quickly though.

@KrisThielemans
Copy link
Collaborator Author

KrisThielemans commented Jul 22, 2024

Summarising the proposal: for a user-configurable eps (a number)

  • compute f = G.forward_project(x) + a and S=G.back_project(m)
  • value: logL = sum((y>0 ? y*log(max(f,eps)) : 0) - m*f) = sum(y>0 ? y*log(max(f,eps)) : 0) - x . S .
  • gradient: G.back_project((y>0 && f >= eps? y/f : 0) - m) = G.back_project((y>0 && f >= eps ? y/f : 0)) - S
  • Hessian * z: G.back_project( (y>0 && f >= eps ? -y/f^2 : 0) * G.forward(z))
  • approximate Hessian * z (legacy code that uses $P^t {1 \over y} P z$): G.back_project( (y>0 ? 1/y : 0) * G.forward(z))

All of these can be computed by summing over list-mode events, aside from the "approximate" version, which is essentially not available (unless you histogram first),

There's a corner case where $y_i&gt;0$ and $f_i&lt;=eps$ will be undefined if $eps&lt;=0$, although on IEEE-fp-compatible architectures will give infinity for value, gradient and Hessian on most architectures when $eps=0$).

Obviously, once infinity is returned, ugly things will happen. The only generic solution for that is to set eps>0.

@KrisThielemans
Copy link
Collaborator Author

Actually, I see now that eps=0 has essentially no benefit, except that it changes behaviour when $f_i&lt;0$ (which will not happen in most circumstances). I suppose this means we should only do the checks on f when eps>0 (or when verbosity>0).

@KrisThielemans KrisThielemans changed the title inconsistency between Poisson loglikelihood value and gradient when using multiplicative factors due to thresholding inconsistency between Poisson loglikelihood value, gradient (when using multiplicative factors) and Hessian due to thresholding Jul 25, 2024
@gschramm
Copy link
Contributor

@KrisThielemans another possibility that came to my mind yesterday.

$$ logL = \sum_i (y_i + \epsilon) \log (e_i + \epsilon) - (e_i + \epsilon) $$

with $\epsilon$ a small positive number.

I think the only problem is that this cannot be evaluated in listmode :-(

@KrisThielemans
Copy link
Collaborator Author

I think the only problem is that this cannot be evaluated in listmode :-(

yeah... I prefer a strategy where results are identical in both, at least in principle. It simplifies testing as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants