Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No segmentation of first and last slice in the ground truth #22

Open
rohanbanerjee opened this issue Oct 4, 2023 · 14 comments
Open

No segmentation of first and last slice in the ground truth #22

rohanbanerjee opened this issue Oct 4, 2023 · 14 comments

Comments

@rohanbanerjee
Copy link
Collaborator

Opening this issue based on the comment here: #21 (comment)

Find a strategy either in pre-processing/training/post-processing to make sure that this does not reflect in the performance of the segmentation model.

@rohanbanerjee rohanbanerjee changed the title No segmentation of first and last slice in the ground truths No segmentation of first and last slice in the ground truth Oct 4, 2023
@MerveKaptan
Copy link
Collaborator

Hi @rohanbanerjee ,

Btw, just FYI, I saw this in some other datasets as well (not only stanford rest).

I was thinking that we could crop the slices that are not segmented prior to entering data into the segmentation model. Let me know what you think!

@jcohenadad
Copy link
Member

I was thinking that we could crop the slices that are not segmented prior to entering data into the segmentation model. Let me know what you think!

why not adding the missing segmentation?

@MerveKaptan
Copy link
Collaborator

Because those slices are usually problematic and were not used in further analysis. Therefore, drawing a mask would be difficult or in some cases it is not possible. This may be due to drop-outs (I assume) or more importantly in the case of Stanford and Weber datasets, we do not use the first and last slices because of 3D motion correction.

@jcohenadad
Copy link
Member

Because those slices are usually problematic and were not used in further analysis. Therefore, drawing a mask would be difficult or in some cases it is not possible. This may be due to drop-outs (I assume) or more importantly in the case of Stanford and Weber datasets, we do not use the first and last slices because of 3D motion correction.

In the examples shown, the cord is pretty visible (at least as well as in the other slices), so I don't see why we should not create a mask in those slices as well.

About the argument of not using those slices for processing: this is a decision that comes later in the pipeline. In general, it is not advised to act on one step of the preprocessing pipeline if later on in the stat you decide to drop one slice. What if you decide to change the analysis and decide to drop two slices instead, or no slice? That would imply revising the entire preprocessing pipeline, which is (i) cumbersome, (ii) not replicable (eg in the case of a shared dataset with derivatives), so I would not encourage this practice. Alternatively, it is recommended to remove the slice stat/processing stage, eg, by only selecting all slices minus the edge ones at the processing/statistics stage. I hope that makes sense.

@MerveKaptan
Copy link
Collaborator

MerveKaptan commented Oct 9, 2023

Hi @jcohenadad,

Thank you so much! I am not sure if I understand the problem wrt excluding the slices, sorry about that. I think I understand if it was a decision made because those slices do not look optimal.
On the other hand, as far as I know, in some of the datasets (such as Stanford), the decision to exclude the top and the bottom slices was not subjective. Indeed it is because of the motion correction strategies (2d and 3d) that we employ that we cannot use those slices (top and the bottom). For instance, if one looks at the time series after motion correction- those slices (top and the bottom) move all over the place and when we take a temporal mean of those - it will not be accurate.

That being said, I would like to ask @kennethaweberii for his input - I may know it incorrectly!!

@jcohenadad
Copy link
Member

Thank you so much! I am not sure if I understand the problem wrt excluding the slices, sorry about that. I think I understand if it was a decision made because those slices do not look optimal.
On the other hand, as far as I know, in some of the datasets (such as Stanford), the decision to exclude the top and the bottom slices was not subjective. Indeed it is because of the motion correction strategies (2d and 3d) that we employ that we cannot use those slices (top and the bottom). For instance, if one looks at the time series after motion correction- those slices (top and the bottom) move all over the place and when we take a temporal mean of those - it will not be accurate.

Let me try to re-explain. I am not criticizing the act of removing the top/bottom slices-- this is perfectly justified. What I am criticizing is where in the pipeline the slices are removed. My suggestion is to remove them after preprocessing, so you have more flexibility to try alternative pipelines without removing those slices, or removing two slices instead of one, etc. The philosophy is to act on a config file as opposed to manually modifying a mask, which gives you less flexibility down the line in your pipeline.

So: all i'm saying is:

  • let's keep the segmentation of the top/bottom slice so that the DL segmentation model can learn better
  • it's ok to remove the slices in your analysis, but do it programatically instead of by manipulating the segmentation mask

i hope this is clearer

@kennethaweberii
Copy link
Collaborator

Hello @jcohenadad and Team,

I am just adding to what @MerveKaptan stated. In our previous analysis pipelines, our first step of motion correction was 3D and allowed translation along the Z-axis. While we may be able to segment the top and bottom slices on the mean image, the top and bottom slices may have missing data across the timeseries depending on the magnitude of translation along the Z-axis. This is the main reason we exclude these slices at this step. We are not excluding the top and bottom slices due to poor image quality.

Reviewers have been critical about our 3D motion correction step, especially with highly anisotropic voxels (1 mm X 1 mm X 5 mm), and I am currently only doing sidewise motion correction going forward.

That said, there are several datasets from multiple sites that exclude the top and bottom slices (sometimes multiple top and bottom slices) from the ground truth, and I do not think it is feasible to go back and segment these slices. Also, if we want to use the timeseries data (20 volumes), there may be missing data in the top and bottom slices across the timeseries, which is another reason to exclude these slices.

I would recommend training only on slices with the ground truth present and calculating the segmentation performance metrics only on slices with the ground truth present. Besides excluding some of the data for training the model, I do not see another major limitation to this approach.

@jcohenadad
Copy link
Member

Besides excluding some of the data for training the model, I do not see another major limitation to this approach.

The problem is during cross-validation of model performance: at test time, the output segmentation will be compared against the ground truth, resulting in apparent false positive voxels (whereas in fact, these would be true positive).

When dealing with multiple datasets from various centers, we need to have some sort of harmonization in terms of ground truth creation protocols, otherwise we end up with case-by-case scenarios, which in my experience is prone to human error.

Given that we have been reviewing some of the segmentations, why not adding the top-bottom slices only for the purpose of the SC segmentation project, and down the line, if you'd rather exclude those slices for analysis purpose, you can do so programatically?

@kennethaweberii
Copy link
Collaborator

@jcohenadad I understand and appreciate your concerns. A few follow-up comments:

I agree that having a ground truth creation protocol would have been ideal, but re-segmenting the datasets would require significant effort and cost from collaborators, which may be too much for some. We could re-segment our datasets based on a protocol, but this would require 3-6 months to complete because we have other higher priority projects. While less ideal, the current approach of using previously segmented images is more practical. It was a significant effort for the collaborators to organize and share their data. I do not believe many would be willing to re-segment their images for a single project. I could be wrong.

I understand that the trained model will segment all slices of the image, and for the images that do not have the top and bottom slices segmented, this would lead to a greater false positive rate compared to the ground truth if the segmentation metrics were calculated across the entire image (all slices); however, the segmentation metrics could just be calculated on those slices with a ground truth. Do we need to calculate them across the entire image?

We could add the top and bottom slices on our end, but this would most likely need to be done by a different rater than the original dataset, which does not feel ideal, but we could do this.

Training and computing the segmentation metrics on only slices with a ground truth seems like the most straightforward and efficient solution to me, but I am of course open to your suggestions.

@jcohenadad
Copy link
Member

I agree that having a ground truth creation protocol would have been ideal, but re-segmenting the datasets would require significant effort and cost from collaborators, which may be too much for some. We could re-segment our datasets based on a protocol, but this would require 3-6 months to complete because we have other higher priority projects. While less ideal, the current approach of using previously segmented images is more practical. It was a significant effort for the collaborators to organize and share their data. I do not believe many would be willing to re-segment their images for a single project. I could be wrong.

@rohanbanerjee started to 'fix' some of the obvious issues with segmentation (eg #13), and we hope to rely on an active learning approach after applying other models in the future (eg: https://github.com/sct-pipeline/contrast-agnostic-softseg-spinalcord). I believe we could review/fix all segmentations within one week with an efficient modus operandi. And these would all be modified within a git-annex branch for full transparency with the collaborating teams.

I understand that the trained model will segment all slices of the image, and for the images that do not have the top and bottom slices segmented, this would lead to a greater false positive rate compared to the ground truth if the segmentation metrics were calculated across the entire image (all slices); however, the segmentation metrics could just be calculated on those slices with a ground truth. Do we need to calculate them across the entire image?

It is of course possible-- but we would need to specify which datasets have top/bottom slices missing, which introduces arbitrary measures into the evaluation (hence the human error component I mentioned earlier). Also, a more fundamental issue: computing Dice scores only on slices with a ground truth would introduce a bias during model training/evaluation. If we only train with data having a segmented spinal cord, we will end up with a segmentation model that will always try to find a spinal cord, even if there is none present (eg: below the cauda equinea, strong signal dropout where segmenting a cord would not be reasonable). Our methodology needs to be scalable to additional datasets where the cord is not present in some slices (as is currently the case with the Zurich dataset for example).

I understand your concerns of asking all contributors to re-do the segmentations, but as I mentioned above, I'm happy to ask my team to do it.

@kennethaweberii
Copy link
Collaborator

It is of course possible-- but we would need to specify which datasets have top/bottom slices missing, which introduces arbitrary measures into the evaluation (hence the human error component I mentioned earlier). Also, a more fundamental issue: computing Dice scores only on slices with a ground truth would introduce a bias during model training/evaluation. If we only train with data having a segmented spinal cord, we will end up with a segmentation model that will always try to find a spinal cord, even if there is none present (eg: below the cauda equinea, strong signal dropout where segmenting a cord would not be reasonable). Our methodology needs to be scalable to additional datasets where the cord is not present in some slices (as is currently the case with the Zurich dataset for example).

@jcohenadad Great point regarding the cauda equina and areas of signal dropout. My suggestion was a little myopic and cervical cord centric. I agree with this plan. Thanks for the discussion and thorough replies. :D

@MerveKaptan
Copy link
Collaborator

Dear all,

Thank you for the discussion! @jcohenadad Please let me know how I can contribute and help :)

Best,
Merve

@jcohenadad
Copy link
Member

Thank you for the discussion! @jcohenadad Please let me know how I can contribute and help :)

let's do a zoom after the ismrm deadline / SC workshop, so we are all on the same page

@MerveKaptan
Copy link
Collaborator

Thank you for the discussion! @jcohenadad Please let me know how I can contribute and help :)

let's do a zoom after the ismrm deadline / SC workshop, so we are all on the same page

Okay, that would be great, thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants