Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Guidance on Discrepancy in Evaluation Results for Trackastra #23

Open
1595813520 opened this issue Jan 15, 2025 · 1 comment

Comments

@1595813520
Copy link

Dear Author,

I hope this message finds you well! First of all, thank you very much for your valuable contribution in this paper. I have encountered a problem while reproducing the experiments and would appreciate your guidance.

Specifically, when I evaluated Trackastra on DynamicNuclearNet, I noticed discrepancies between my results and the data presented in your paper. In particular, I trained a Trackastra model using ground truth segmentation with a 91:27:12 train-validation-test split, but the evaluation score was significantly lower than the result in Table 2 of your paper ("Tracking results on DeepCell"). Below is a screenshot of the evaluation result I obtained:

image

The first two rows show the testing results of the model weights I trained on DeepCell, while the last two rows show the testing results using the pre-trained general-2d weights.

To investigate further, I suspected there might be an issue with my training setup. Therefore, I evaluated the Out-of-domain results on the Hela test set (since the Fluo-N2DL-HeLa test set does not have ground truth annotations, I used the training set of the same size for evaluation). The results are shown below:

image

Again, the first two rows show the results of my own trained weights, while the last two rows show the results using the pre-trained general-2d weights. Although the scores from both evaluations are similar, they still differ slightly from the results presented in Table 3 of your paper ("Out-of-domain results on Hela"). This suggests that my training process may not be the issue, as I followed the methodology presented in the paper.

However, despite extensive investigation, I have been unable to pinpoint the cause of the discrepancy. Could you kindly offer any insights into what might be causing this difference in evaluation results compared to the tables in your paper?

Below is the evaluation code I used for your reference:

root = Path("/data/trackastra/data/deepcell/test")
idx = "001"

image_dir = root / idx
seg_dir = root / f"{idx}_GT" / "SEG"
output_dir = root / f"{idx}_RES"
os.makedirs(output_dir, exist_ok=True)

image_files = sorted(image_dir.glob("*.tif"))    # t000.tif, t001.tif, ...
seg_files = sorted(seg_dir.glob("*.tif"))        # man_seg000.tif, man_seg001.tif, ...

imgs = np.array([tifffile.imread(str(f)) for f in image_files])
masks = np.array([tifffile.imread(str(f)) for f in seg_files])

model_path = "/data/trackastra/scripts/runs/2024-12-28_14-01-30_example/"
model = Trackastra.from_folder(model_path, device=device)

track_graph = model.track(imgs, masks, mode="greedy")

ctc_tracks, masks_tracked = graph_to_ctc(
    track_graph,
    masks,
    outdir=output_dir,
)

And I used the traccuracy library to calculate the CTC score as follows:

import os
import pprint
import urllib.request
import zipfile
import pandas as pd
from tqdm import tqdm
from traccuracy import run_metrics
from traccuracy.loaders import load_ctc_data
from traccuracy.matchers import CTCMatcher, IOUMatcher
from traccuracy.metrics import CTCMetrics, DivisionMetrics

pp = pprint.PrettyPrinter(indent=4)

gt_data = load_ctc_data(
    '/data/trackastra/data/deepcell/test/001_GT/TRA',
    '/data/trackastra/data/deepcell/test/001_GT/TRA/man_track.txt',
)

pred_data = load_ctc_data(
    '/data/trackastra/data/deepcell/test/001_RES',
    '/data/trackastra/data/deepcell/test/001_RES/man_track.txt',
)

ctc_results = run_metrics(
    gt_data=gt_data,
    pred_data=pred_data,
    matcher=CTCMatcher(),
    metrics=[
        CTCMetrics(),  
        DivisionMetrics()  
    ],
)

print(f"{ctc_results}")

I sincerely appreciate your time and attention to this matter, and any insights or guidance you can provide would be greatly appreciated.

Thank you again for your help!

@bentaculum
Copy link
Member

Hi @1595813520,

thanks for pointing out these issues.

I am able to reproduce the AOGM results you report on Fluo-N2DL-HeLa with the public general_2d model when I use only the silver truth 0x_ST/SEG folder as masks.

To clarify, in the paper, we have used the maximum of 0x_ST/SEG and 0x_GT/TRA as masks for both training and evaluation (see here). The reason being, there are some missing detections in the silver truth, which are marked with small disks in the gold truth, which the maximum operator includes. Not using these detections causes a bloated AOGM score, since false negative detections are weighted by a factor of 10 in default AOGM.

With the current pypi versions of Trackastra and Traccuracy I am obtaining AOGM scores similar to the ones reported in the paper (e.g. AOGM 163.5 for HeLa training videos 1&2 with our general_2d model and greedy linking), admittedly with a small performance gap at the moment.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants