inconsistency between Poisson loglikelihood value, gradient (when using multiplicative factors) and Hessian due to thresholding #1472

KrisThielemans · 2024-07-12T11:03:17Z

Continuing from #1461...

We attempt to avoid zero or negatives in the log (or division be zero or negatives in the gradient) by thresholding the quotient. Unfortunately, the code is not consistent, even after many attempts.

First introducing notation:
$T(e, y) = max(e, y/Q)$
(i.e. lower threshold on $e$), and (full) forward projection $e = P\lambda+b$.

formulas

A strategy could be

$ \log L = \sum_i y_i \log ( T(e_i,y_i) ) - T(e_i,y_i)$ [1]

This function has a gradient w.r.t. $\lambda$ (with some element-wise multiplications etc., and ignoring the non-differentiable points)

$ { P^T {\partial T \over \partial e} (e, y) ( {y \over T(e,y) } - 1) }$ [2]

with

$ {\partial T \over \partial e} T(e,y) = 1$ if $e>= y/Q$ and $0$ otherwise [3]

This gradient is equivalent to back-projecting

e >= y/Q ? y/e : 0 [4]

Log-likelihood code

The function [1] is what is computed by PoissonLogLikelihoodWithLinearModelForMeanAndProjData for the value via

STIR/src/buildblock/recon_array_functions.cxx

Lines 367 to 371 in 12bfa87

    
           const float new_estimate = max(estimated_projections[r][b], projection_data[r][b] / max_quotient); 
        
           if (projection_data[r][b] <= small_value) 
        
             sub_result += -double(new_estimate); 
        
           else 
        
             sub_result += projection_data[r][b] * log(double(new_estimate)) - double(new_estimate);

where the arguments are computed

STIR/src/recon_buildblock/PoissonLogLikelihoodWithLinearModelForMeanAndProjData.cxx

Lines 1451 to 1467 in 12bfa87

    
           forward_projector_sptr->forward_project(estimated_viewgrams); 
        
           if (additive_binwise_correction_ptr != NULL) 
        
             { 
        
               estimated_viewgrams += (*additive_binwise_correction_ptr); 
        
             }; 
        
           if (mult_viewgrams_ptr != NULL) 
        
             { 
        
               estimated_viewgrams *= (*mult_viewgrams_ptr); 
        
             } 
        
           RelatedViewgrams<float>::iterator meas_viewgrams_iter = measured_viewgrams_ptr->begin(); 
        
           RelatedViewgrams<float>::const_iterator est_viewgrams_iter = estimated_viewgrams.begin(); 
        
           // call function that does the actual work, it sits in recon_array_funtions.cxx (TODO) 
        
           for (; meas_viewgrams_iter != measured_viewgrams_ptr->end(); ++meas_viewgrams_iter, ++est_viewgrams_iter) 
        
             accumulate_loglikelihood(*meas_viewgrams_iter, *est_viewgrams_iter, rim_truncation_sino, log_likelihood_ptr);

Gradient code

For the gradient we use the optimisation that any multiplicative factors drop out.

STIR/src/recon_buildblock/PoissonLogLikelihoodWithLinearModelForMeanAndProjData.cxx

Lines 1407 to 1433 in 12bfa87

    
           forward_projector_sptr->forward_project(estimated_viewgrams); 
        
           if (additive_binwise_correction_ptr != NULL) 
        
             estimated_viewgrams += (*additive_binwise_correction_ptr); 
        
           // for sinogram division 
        
           divide_and_truncate(*measured_viewgrams_ptr, estimated_viewgrams, rim_truncation_sino, count, count2, log_likelihood_ptr); 
        
           // adding the sensitivity:  backproj[y/ybar] * 
        
           // not adding the sensitivity computes the gradient:  backproj[y/ybar - 1] * 
        
           // * ignoring normalisation * 
        
           if (!add_sensitivity) 
        
             { 
        
               if (mult_viewgrams_ptr) 
        
                 { 
        
                   // subtract normalised ones from the data [y/ybar - 1/N] 
        
                   *measured_viewgrams_ptr -= *mult_viewgrams_ptr; 
        
                 } 
        
               else 
        
                 { 
        
                   // No mult_viewgrams_ptr, subtract ones [y/ybar - 1] 
        
                   *measured_viewgrams_ptr -= 1; 
        
                 } 
        
             } 
        
           // back project 
        
           back_projector_sptr->back_project(*measured_viewgrams_ptr);

Unfortunately, the divide_and_truncate method (which implements the division [4]) does not know about the multiplicative factors, and is therefore using a different threshold than the logL value.

Gradient formulas

Writing the full forward projection as $e= P\lambda + b = m (G\lambda+a)$, and the forward projection without $m$ as $ f = G(\lambda+a)$, the current gradient computation is

G.back_project((f >= y/Q ? y/f : 0) - m)

This is equivalent to

P.back_project((e >= m*y/Q ? y/e : 0) - 1)

Conclusion

There are 2 inconsistencies:

In both divide_and_truncate and accumulate_likelihood, Q=10000, such that the quotient in the gradient computation is not consistent with the log-term in the logL value wherever only one of those reaches the threshold, so if $m<1$ where $m y/Q < e < y/Q $.
the "sensitivity" term uses a threshold in the current logL computation, but not in the gradient computation.

The text was updated successfully, but these errors were encountered:

KrisThielemans · 2024-07-12T15:03:07Z

Note that in most cases, $m$ is the product of attenuation factors (of the order .001 to 1) and "efficiencies", which for most scanners (without calibration) are around 1, so $m$ is likely less than 1. It also means that the threshold (on $e$) in the gradient calculation is currently lower than 10000, and will probably never be reached, which is good. (The logL calculation will not correspond to the gradient for $y$ for slightly larger $e$, but these are still small). However, as soon as we'd include duration, this won't be the case (or calibration factors, which I'm not sure what values they'd have).

Side note: the code has separate handling for the $y=0$ (or actually, y<= max(viewgram)* 0.000001) case, in which case the log terms gets dropped, as required.

Current impact estimation:

any algorithm that doesn't use the value of the log-likelihood is not affected by this discrepancy
as long as $m<=1$ (i.e. "normalisation factors" are around 1 or higher), the gradient-threshold is unlikely to be reached, and definitely not when $b>0$. In any case, it only matters where $y>0$.

Nevertheless, these thresholds should be consistent, and in any case are probably surprising to most people (as not documented). Also, the current gradient strategy is sensitive to the scale of $m$, which is dangerous.

KrisThielemans · 2024-07-12T15:17:12Z

After discussion with @mehrhardt, proposed strategy in first instance is

make threshold configurable
- allow to switch it off, in which case logL or its gradient could be undefined, NaN or infinite
- or increase $Q$ such that is less likely to be hit)
use threshold in terms of $e$, not $f$ (this implies passing $m$ to divide_and_truncate)
document behaviour

Things to consider:

use soft thresholding to keep things differentiable
move thresholding to the algorithm, as opposed to the objective function.

KrisThielemans · 2024-07-12T20:29:12Z

There's another possible choice for the objective function:

$ \log L = \sum_i y_i \log ( T(e_i,y_i) ) - e_i$ [5]

This choice has a few advantages (and is also preferred by @mehrhardt):

It could be more natural to only handle problems in the log term
It allows computing the latter term via the sensitivity (it is equivalent to $\sum_v \mathrm{sensitivity}_v \lambda_v $). Otherwise, at least in principle we have to take the bins into account where the threshold was applied. Aside from being very messy, [1] is also virtually impossible to implement in the list-mode objective function.

After more discussions with @NicoleJurjew, two other things came up:

if the threshold depends on $y$, it cannot be reproduced in the list-mode objective function
to save time/memory, the gradient calculation can be done without the "sensitivity" term (this is called add_sensitivity=true, as it computes the gradient + sensitivity). In that case, $m$ is not passed to the above function (see here. That means that we cannot use a threshold strategy that depends on $m$ for the gradient.

Proposal

For $f= G\lambda+a$, implement

$ \log L = \sum_i y_i \log ( \max(f_i,\epsilon) ) - m_i*f_i$ [6]

with $\epsilon$ a user parameter, hopefully defaulting to 0. Note that this will NOT be backwards compatible for ProjDatawith either the current log-likelihood nor the gradient computation, whenever the thresholds are reached (but this should be really infrequently). Note that even without thresholds, [5] differs from [1] with a global constant (depending on $m$), but it is compatible with the list-mode data objective function value

@mehrhardt @gschramm any comments?

KrisThielemans · 2024-07-12T22:04:51Z

Other things noted while investigating all this:

divide_and_truncate allows accumulating the "log-likelihood", but it is wrong unless the measured data would contain m.
the listmode obj function calculation seems to miss an else for the non-threshold case.

gschramm · 2024-07-22T09:27:08Z

To me, [5] / [6] with a default eps = 0, feel most "natural".

How often does the "log(0)" problem occur in real PET/SPECT systems where there should be always a finite amount of contamination (except for SPECT scans without scatter)?

KrisThielemans · 2024-07-22T09:40:51Z

How often does the "log(0)" problem occur in real PET/SPECT systems where there should be always a finite amount of contamination (except for SPECT scans without scatter)?

In "traditional" algorithms without precorrections, and the background model is somewhat accurate, then I don't think it should ever occur. However, there are a few cases where it could occur

people do not always do the "proper" things: examples
- NAC reconstructions without background
- reconstructions with a noisy "background" estimate
- reconstruction of precorrected data (still very often done in SPECT where scatter is precorrected (badly))
- initialisation with an image that has too small support (i.e. is zero where it shouldn't). This can also happen with OSEM (where a subset sets a voxel to zero when it shouldn't be).
step-size "overshoot", e.g. in (somewhat careless) line-searches
algorithms that explicitly allow negative voxel values, see some refs in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5854483/. Depending on algorithm/implementation, these thresholds might be useful or not.

The challenge for STIR is that most people expect it to "just work" even if they throw weird stuff at it...

gschramm · 2024-07-22T09:55:26Z

I see. I think some negative pixels, as long as the forward model stays positive, are ok.

Wouldn't it make more sense to return infinity as soon as a single bin in the expectation is <= 0?
Of course that would require checks slowing things down and we would lose differentiability.

KrisThielemans · 2024-07-22T10:08:33Z

I think some negative pixels, as long as the forward model stays positive, are ok.

I don't disagree, but that's a different optimisation problem from what most people will expect.

Wouldn't it make more sense to return infinity as soon as a single bin in the expectation is <= 0? Of course that would require checks slowing things down and we would lose differentiability.

Not all architectures can represent infinity (although these days, that's possibly less of a worry, as the IEEE floating point standard is now very widely adapted and I see that even CUDA represents infinity).

Clearly, the above thresholding strategies all require checks and will slow things down, and indeed the proposed function is not differentiable. An alternative is to just not do any checks, and let the result be undefined. In my experience, that throws up problems very quickly though.

KrisThielemans · 2024-07-22T10:35:37Z

Summarising the proposal: for a user-configurable eps (a number)

compute f = G.forward_project(x) + a and S=G.back_project(m)
value: logL = sum((y>0 ? y*log(max(f,eps)) : 0) - m*f) = sum(y>0 ? y*log(max(f,eps)) : 0) - x . S .
gradient: G.back_project((y>0 && f >= eps? y/f : 0) - m) = G.back_project((y>0 && f >= eps ? y/f : 0)) - S
Hessian * z: G.back_project( (y>0 && f >= eps ? -y/f^2 : 0) * G.forward(z))
approximate Hessian * z (legacy code that uses $P^t {1 \over y} P z$): G.back_project( (y>0 ? 1/y : 0) * G.forward(z))

All of these can be computed by summing over list-mode events, aside from the "approximate" version, which is essentially not available (unless you histogram first),

There's a corner case where $y_i>0$ and $f_i<=eps$ will be undefined if $eps<=0$, although on IEEE-fp-compatible architectures will give infinity for value, gradient and Hessian on most architectures when $eps=0$).

Obviously, once infinity is returned, ugly things will happen. The only generic solution for that is to set eps>0.

KrisThielemans · 2024-07-22T13:03:59Z

Actually, I see now that eps=0 has essentially no benefit, except that it changes behaviour when $f_i<0$ (which will not happen in most circumstances). I suppose this means we should only do the checks on f when eps>0 (or when verbosity>0).

gschramm · 2024-07-25T14:12:44Z

@KrisThielemans another possibility that came to my mind yesterday.

$$ logL = \sum_i (y_i + \epsilon) \log (e_i + \epsilon) - (e_i + \epsilon) $$

with $\epsilon$ a small positive number.

I think the only problem is that this cannot be evaluated in listmode :-(

KrisThielemans · 2024-07-25T14:40:21Z

I think the only problem is that this cannot be evaluated in listmode :-(

yeah... I prefer a strategy where results are identical in both, at least in principle. It simplifies testing as well!

KrisThielemans self-assigned this Jul 12, 2024

KrisThielemans changed the title ~~inconsistency between Poisson loglikelihood value and gradient when using multplicative factors due to thresholding~~ inconsistency between Poisson loglikelihood value and gradient when using multiplicative factors due to thresholding Jul 12, 2024

KrisThielemans mentioned this issue Jul 13, 2024

accumulate_Hessian_times_input for Poisson...ProjData fails for input containing negatives #1461

Closed

KrisThielemans mentioned this issue Jul 22, 2024

cost function mentioned in main_OSEM.py SyneRBI/PETRIC#71

Closed

KrisThielemans changed the title ~~inconsistency between Poisson loglikelihood value and gradient when using multiplicative factors due to thresholding~~ inconsistency between Poisson loglikelihood value, gradient (when using multiplicative factors) and Hessian due to thresholding Jul 25, 2024

KrisThielemans linked a pull request Jul 25, 2024 that will close this issue

change Poisson log-likelihood truncation/thresholding strategy #1482

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inconsistency between Poisson loglikelihood value, gradient (when using multiplicative factors) and Hessian due to thresholding #1472

inconsistency between Poisson loglikelihood value, gradient (when using multiplicative factors) and Hessian due to thresholding #1472

KrisThielemans commented Jul 12, 2024 •

edited

Loading

KrisThielemans commented Jul 12, 2024 •

edited

Loading

KrisThielemans commented Jul 12, 2024

KrisThielemans commented Jul 12, 2024 •

edited

Loading

KrisThielemans commented Jul 12, 2024 •

edited

Loading

gschramm commented Jul 22, 2024

KrisThielemans commented Jul 22, 2024 •

edited

Loading

gschramm commented Jul 22, 2024 •

edited

Loading

KrisThielemans commented Jul 22, 2024

KrisThielemans commented Jul 22, 2024 •

edited

Loading

KrisThielemans commented Jul 22, 2024

gschramm commented Jul 25, 2024

KrisThielemans commented Jul 25, 2024

inconsistency between Poisson loglikelihood value, gradient (when using multiplicative factors) and Hessian due to thresholding #1472

inconsistency between Poisson loglikelihood value, gradient (when using multiplicative factors) and Hessian due to thresholding #1472

Comments

KrisThielemans commented Jul 12, 2024 • edited Loading

formulas

Log-likelihood code

Gradient code

Gradient formulas

Conclusion

KrisThielemans commented Jul 12, 2024 • edited Loading

KrisThielemans commented Jul 12, 2024

KrisThielemans commented Jul 12, 2024 • edited Loading

Proposal

KrisThielemans commented Jul 12, 2024 • edited Loading

gschramm commented Jul 22, 2024

KrisThielemans commented Jul 22, 2024 • edited Loading

gschramm commented Jul 22, 2024 • edited Loading

KrisThielemans commented Jul 22, 2024

KrisThielemans commented Jul 22, 2024 • edited Loading

KrisThielemans commented Jul 22, 2024

gschramm commented Jul 25, 2024

KrisThielemans commented Jul 25, 2024

KrisThielemans commented Jul 12, 2024 •

edited

Loading

KrisThielemans commented Jul 12, 2024 •

edited

Loading

KrisThielemans commented Jul 12, 2024 •

edited

Loading

KrisThielemans commented Jul 12, 2024 •

edited

Loading

KrisThielemans commented Jul 22, 2024 •

edited

Loading

gschramm commented Jul 22, 2024 •

edited

Loading

KrisThielemans commented Jul 22, 2024 •

edited

Loading