You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I hope this message finds you well! First of all, thank you very much for your valuable contribution in this paper. I have encountered a problem while reproducing the experiments and would appreciate your guidance.
Specifically, when I evaluated Trackastra on DynamicNuclearNet, I noticed discrepancies between my results and the data presented in your paper. In particular, I trained a Trackastra model using ground truth segmentation with a 91:27:12 train-validation-test split, but the evaluation score was significantly lower than the result in Table 2 of your paper ("Tracking results on DeepCell"). Below is a screenshot of the evaluation result I obtained:
The first two rows show the testing results of the model weights I trained on DeepCell, while the last two rows show the testing results using the pre-trained general-2d weights.
To investigate further, I suspected there might be an issue with my training setup. Therefore, I evaluated the Out-of-domain results on the Hela test set (since the Fluo-N2DL-HeLa test set does not have ground truth annotations, I used the training set of the same size for evaluation). The results are shown below:
Again, the first two rows show the results of my own trained weights, while the last two rows show the results using the pre-trained general-2d weights. Although the scores from both evaluations are similar, they still differ slightly from the results presented in Table 3 of your paper ("Out-of-domain results on Hela"). This suggests that my training process may not be the issue, as I followed the methodology presented in the paper.
However, despite extensive investigation, I have been unable to pinpoint the cause of the discrepancy. Could you kindly offer any insights into what might be causing this difference in evaluation results compared to the tables in your paper?
Below is the evaluation code I used for your reference:
I am able to reproduce the AOGM results you report on Fluo-N2DL-HeLa with the public general_2d model when I use only the silver truth 0x_ST/SEG folder as masks.
To clarify, in the paper, we have used the maximum of 0x_ST/SEG and 0x_GT/TRA as masks for both training and evaluation (see here). The reason being, there are some missing detections in the silver truth, which are marked with small disks in the gold truth, which the maximum operator includes. Not using these detections causes a bloated AOGM score, since false negative detections are weighted by a factor of 10 in default AOGM.
With the current pypi versions of Trackastra and Traccuracy I am obtaining AOGM scores similar to the ones reported in the paper (e.g. AOGM 163.5 for HeLa training videos 1&2 with our general_2d model and greedy linking), admittedly with a small performance gap at the moment.
Dear Author,
I hope this message finds you well! First of all, thank you very much for your valuable contribution in this paper. I have encountered a problem while reproducing the experiments and would appreciate your guidance.
Specifically, when I evaluated Trackastra on DynamicNuclearNet, I noticed discrepancies between my results and the data presented in your paper. In particular, I trained a Trackastra model using ground truth segmentation with a 91:27:12 train-validation-test split, but the evaluation score was significantly lower than the result in Table 2 of your paper ("Tracking results on DeepCell"). Below is a screenshot of the evaluation result I obtained:
The first two rows show the testing results of the model weights I trained on DeepCell, while the last two rows show the testing results using the pre-trained general-2d weights.
To investigate further, I suspected there might be an issue with my training setup. Therefore, I evaluated the Out-of-domain results on the Hela test set (since the Fluo-N2DL-HeLa test set does not have ground truth annotations, I used the training set of the same size for evaluation). The results are shown below:
Again, the first two rows show the results of my own trained weights, while the last two rows show the results using the pre-trained general-2d weights. Although the scores from both evaluations are similar, they still differ slightly from the results presented in Table 3 of your paper ("Out-of-domain results on Hela"). This suggests that my training process may not be the issue, as I followed the methodology presented in the paper.
However, despite extensive investigation, I have been unable to pinpoint the cause of the discrepancy. Could you kindly offer any insights into what might be causing this difference in evaluation results compared to the tables in your paper?
Below is the evaluation code I used for your reference:
And I used the traccuracy library to calculate the CTC score as follows:
I sincerely appreciate your time and attention to this matter, and any insights or guidance you can provide would be greatly appreciated.
Thank you again for your help!
The text was updated successfully, but these errors were encountered: