Request for Guidance on Discrepancy in Evaluation Results for Trackastra #23

1595813520 · 2025-01-15T12:35:41Z

Dear Author,

I hope this message finds you well! First of all, thank you very much for your valuable contribution in this paper. I have encountered a problem while reproducing the experiments and would appreciate your guidance.

Specifically, when I evaluated Trackastra on DynamicNuclearNet, I noticed discrepancies between my results and the data presented in your paper. In particular, I trained a Trackastra model using ground truth segmentation with a 91:27:12 train-validation-test split, but the evaluation score was significantly lower than the result in Table 2 of your paper ("Tracking results on DeepCell"). Below is a screenshot of the evaluation result I obtained:

The first two rows show the testing results of the model weights I trained on DeepCell, while the last two rows show the testing results using the pre-trained general-2d weights.

To investigate further, I suspected there might be an issue with my training setup. Therefore, I evaluated the Out-of-domain results on the Hela test set (since the Fluo-N2DL-HeLa test set does not have ground truth annotations, I used the training set of the same size for evaluation). The results are shown below:

Again, the first two rows show the results of my own trained weights, while the last two rows show the results using the pre-trained general-2d weights. Although the scores from both evaluations are similar, they still differ slightly from the results presented in Table 3 of your paper ("Out-of-domain results on Hela"). This suggests that my training process may not be the issue, as I followed the methodology presented in the paper.

However, despite extensive investigation, I have been unable to pinpoint the cause of the discrepancy. Could you kindly offer any insights into what might be causing this difference in evaluation results compared to the tables in your paper?

Below is the evaluation code I used for your reference:

root = Path("/data/trackastra/data/deepcell/test")
idx = "001"

image_dir = root / idx
seg_dir = root / f"{idx}_GT" / "SEG"
output_dir = root / f"{idx}_RES"
os.makedirs(output_dir, exist_ok=True)

image_files = sorted(image_dir.glob("*.tif"))    # t000.tif, t001.tif, ...
seg_files = sorted(seg_dir.glob("*.tif"))        # man_seg000.tif, man_seg001.tif, ...

imgs = np.array([tifffile.imread(str(f)) for f in image_files])
masks = np.array([tifffile.imread(str(f)) for f in seg_files])

model_path = "/data/trackastra/scripts/runs/2024-12-28_14-01-30_example/"
model = Trackastra.from_folder(model_path, device=device)

track_graph = model.track(imgs, masks, mode="greedy")

ctc_tracks, masks_tracked = graph_to_ctc(
    track_graph,
    masks,
    outdir=output_dir,
)

And I used the traccuracy library to calculate the CTC score as follows:

import os
import pprint
import urllib.request
import zipfile
import pandas as pd
from tqdm import tqdm
from traccuracy import run_metrics
from traccuracy.loaders import load_ctc_data
from traccuracy.matchers import CTCMatcher, IOUMatcher
from traccuracy.metrics import CTCMetrics, DivisionMetrics

pp = pprint.PrettyPrinter(indent=4)

gt_data = load_ctc_data(
    '/data/trackastra/data/deepcell/test/001_GT/TRA',
    '/data/trackastra/data/deepcell/test/001_GT/TRA/man_track.txt',
)

pred_data = load_ctc_data(
    '/data/trackastra/data/deepcell/test/001_RES',
    '/data/trackastra/data/deepcell/test/001_RES/man_track.txt',
)

ctc_results = run_metrics(
    gt_data=gt_data,
    pred_data=pred_data,
    matcher=CTCMatcher(),
    metrics=[
        CTCMetrics(),  
        DivisionMetrics()  
    ],
)

print(f"{ctc_results}")

I sincerely appreciate your time and attention to this matter, and any insights or guidance you can provide would be greatly appreciated.

Thank you again for your help!

The text was updated successfully, but these errors were encountered:

bentaculum · 2025-02-07T17:31:17Z

Hi @1595813520,

thanks for pointing out these issues.

I am able to reproduce the AOGM results you report on Fluo-N2DL-HeLa with the public general_2d model when I use only the silver truth 0x_ST/SEG folder as masks.

To clarify, in the paper, we have used the maximum of 0x_ST/SEG and 0x_GT/TRA as masks for both training and evaluation (see here). The reason being, there are some missing detections in the silver truth, which are marked with small disks in the gold truth, which the maximum operator includes. Not using these detections causes a bloated AOGM score, since false negative detections are weighted by a factor of 10 in default AOGM.

With the current pypi versions of Trackastra and Traccuracy I am obtaining AOGM scores similar to the ones reported in the paper (e.g. AOGM 163.5 for HeLa training videos 1&2 with our general_2d model and greedy linking), admittedly with a small performance gap at the moment.

Hope this helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Guidance on Discrepancy in Evaluation Results for Trackastra #23

Request for Guidance on Discrepancy in Evaluation Results for Trackastra #23

1595813520 commented Jan 15, 2025

bentaculum commented Feb 7, 2025

Request for Guidance on Discrepancy in Evaluation Results for Trackastra #23

Request for Guidance on Discrepancy in Evaluation Results for Trackastra #23

Comments

1595813520 commented Jan 15, 2025

bentaculum commented Feb 7, 2025