-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardizing Neuropoly datasets (One more step towards BIDS) #282
Comments
Hey @NathanMolinier! I have to say that the issue description summarizes the BIDS format quite well (much much better than the original documentation, which is hard to understand). Great job! One minor clarification: Are you sure that the the
Also, a very important key in this is the
|
Hey @naga-karthik ! Thanks for your comment ! Regarding the
The date field that you added is here not relevant because it could be different for all the images in the derivative folder. |
Nathan's description of the difference is good; basically we can put all the information we want in the json sidecars, but for
|
Thanks, @NathanMolinier, for this initiative! The description is clear and makes sense! 👍🏻
We usually call many different algorithms within a single pipeline (e.g., process_data.sh bash script), for example, Using multiple Also, for the retrospective rearrangement of the existing datasets, we do not always know if SC seg was obtained using |
Thanks @valosekj ! Great questions !
For this workflow, all the files generated using this pipeline can be kept inside the same derivative folder. In this case we will need to add the different functions computed inside the dataset_description.json{
"BIDSVersion": "1.9.0",
"Name": "<dataset_name>",
"DatasetType": "derivative",
"GeneratedBy": [
{
"Name": "sct_deepseg_sc",
"Version": "SCT v6.1"
},
{
"Name": "sct_deepseg_gm",
"Version": "SCT v6.1"
},
{
"Name": "sct_label_vertebrae",
"Version": "SCT v6.1"
},
{
"Name": "Manual",
"Description": "Manually corrected by Nathan Molinier"
}
]
} Regarding the name of the folder, I believe there is nothing covering that. However, I think we should agree on a naming pattern for the lab datasets.
I understand that this point is a bit problematic. However, if you look at all the codes created in the lab, I believe we can identify more ways to import the data than the number of people currently working in the lab. Having a different way to import the data depending on the dataset is not a good thing. I truly believe that standardizing could help us finding the "good" behavior to use/import our data.
Thanks to BIDS standard, this information should be located also in the |
The items in the |
The dataset curation section on our intranet was updated. Feel free to give me some feedback on it @mguaypaq @naga-karthik @valosekj @jcohenadad @sandrinebedard @plbenveniste @Nilser3 |
Hey @NathanMolinier , Thanks you! |
@Nilser3 please feel free to add it directly, or to suggest the change via a PR |
@NathanMolinier I took the liberty to do this neuropoly/intranet.neuro.polymtl.ca@5666e57, and i'm planning to do the same for the rest of the document-- please let me know if you disagree |
I was hesitant to do that while writing the documentation, but I finally kept the format with details to reduce the overall size of the document. However, I do understand that being able to access specific features using relative links can be interesting. We could also decide to use these titles in combination with the details. |
Regarding this, I need to figure out how we should handle this case. |
You mean, keep the headings (ie: hyperlink) sections, and below the section, use the "details" to make it look shorter? I'm not a big fan of that. It will involve more clicking. Putting myself in the shoes of someone who wants to create a BIDS dataset, they will need to go through all sub-sections, and they can do so easily from the clickable TOC, so why add more clicks? |
Yes it was exactly what I meant. So, it's fine with me if we decide to switch to your version with more headings. |
This is a pretty big deal: In the past we decided to keep the contrast and add the derivatives entities (eg: |
About this: In the past, I vaguely remember we discussed this and agreed to use We also need to keep in mind that the BIDS-derivative standard is changing quite fast, so it is possible that the "standard" will change in 1y from now (hence the pragmatic decision to also allow our internal rules to prevail) |
I'm not sure we should introduce the distinction between seg and label: With our use cases, I see more confusion being introduced by future members, which means more trouble for the core team to maintain the consistency of our database |
I would not introduce these two: Instead, I would use a single suffix |
Regarding the fact that raw suffixes (contrasts in our case) are removed from the original image names and replaced by the
About this last point, I strongly believe that standards such as BIDS are dependent on trust. Put differently, if individuals opt not to adhere entirely to the fundamental principles of a standard, we will ultimately end up with numerous BIDS-like datasets that are incompatible. Regarding our old scripts, and the fact that they may not work on these new datasets, it is important to remember that they were created based on specific versions of the datasets and thanks to Git, it is always possible to checkout to an anterior version (commit). Also, I tried to find related conversations on the BIDS repository and I found this issue where people are talking about an entity Finally, the practice of excluding the image suffix during the creation of derived data is not a novel concept. This feature was originally described in the initial version of the documentation addressing derivative datasets (v.1.4.0). |
#282 (comment) these are excellent points, @NathanMolinier |
We still added the suffix |
Beyond thinking between Currently, most of our datasets are private and are only used for our different projects mainly to train and test AI models. However, for reproducibility, all the scripts used for these projects are available on github. Therefore, changing the way we standardize/use our data will impact directly these scripts: Workaround 1:
Workaround 2: see mguaypaq's slides
Workaround 3:
If we decide to go with workaround 1, we will need to clarify within all our scripts that we are not complying to BIDS derivatives format and thus people will have to modify our scripts or transform their data to comply with our own custom format. From my point of view, this will make our scripts less likely to be used and scripts like manual correction will only be used for our internal usage. If we decide to go with workaround 3, BIDS users would be more likely interested in our scripts. Regarding entities and suffixes, having custom ones will not be as impactful because this could be easily mitigated using special parser flags. I’m not saying that BIDS is perfect (I personally think that suffixes should be avoided) but I think that complying to a standard could positively impact the scope of our research work. Because having our own custom standard will only generate more barriers with other researchers. |
Thank you for your insights @NathanMolinier . These are solid arguments. For the sake of comprehensiveness, I will copy/paste the minutes from our previous meeting: About the derivative suffix (ie. Workaround 1 vs. 2), here are some additional pros/cons:
About the use of About “blabel”, “dlabel”: We decided to not use them. These are too confusing and unnecessary. Some Examples:
|
cross ref to gslide explaining additional BIDS context |
another meeting on 2023-12-12-- trying to find consensual decision about what to do: About the derivative suffix (ie. Workaround 1 vs. 2 vs. 3):
About the use of probseg: We opt for “softseg”.
About “blabel”, “dlabel”: We decided to not use them. These are too confusing and unnecessary. Some Examples: (someone please continue that list) todo:
|
One last question about our convention: BIDS says that Derivative data obtained using DIFFERENT processes/workflows should be stored using DIFFERENT derivatives folders, example:
The idea is to use the However, I believe we mentioned that we maybe would like to keep only one folder under derivatives ? Solution 1: We use multiple folders to track the processes used to create the data. Solution 2: We decide to regroup all the files under a same derivative dataset called What do we decide ? I just want to properly update the intranet. |
Yeah, I think so too. I.e., "Solution 2":
I believe this solution is also better for bash scripts where we check if a label already exists under |
It also means that we will have to remove the derivative folder |
I would say a mix between solution 1 and solution 2. For most datasets, we only have binary labels. So let's call all of them 'labels' for the reason raised by @valosekj (easier to crawl inside multiple datasets). on the other hand, in some cases, we have software specific, or different types of labels, that we would like to distinguish from (eg: labels_softseg). In those cases, I see no alternative than doing a subfolder per set of labels. |
I would go with Julien's approach!
For the |
IIUC, soft segmentations such as These file name examples are taken from #282 (comment) |
I understand completely, that's why I believe that solution 1 should not be set aside. The contrast agnostic project will not be the last project asking for a particular data. In fact, having special data for a specific project is not new: the spinegeneric dataset and canproco do contain labels generated in a different image space that the one used for images, however the data still is currently stored under |
isn't it what the JSON sidecar is supposed to address? About the canproco: i'm surprised it contains labels that are not in the same voxel space-- @plbenveniste do you have more details about this? |
A note from
This happened because I got inspiration from the spine-generic analysis pipeline. |
Yes also, but people could miss this information from the JSON sidecars, from a human perspective having separate folder may be more visual. Also, I'm not saying that I'm completely against having all the files within the same derivative folder, but I do think that we should not completely avoid solution 1. This could be helpful for other projects like the contrast-agnostic SC segmentation. |
Right, that makes perfect sense. In any case, it looks like we are converging towards a solution where some labels are not in the same space, and the pre-processing (before training) pipeline will take care of put them back into the appropriate space. |
Labels in a different space that the one used for images will also have a new entity |
Description
This issue will be used to centralize the main discussions regarding our standardization strategies. The ultimate objective is to enable the execution of BIDS validator scripts within the derivatives folders of each dataset in our collection.
BIDS conventions
1. Important links
BIDS specifications
Neuropoly intranet
2. Important points regarding the
raw
dataset for MRIAs you may know, subjects folders in the raw datasets are structured as follows for MRI,
Raw structure
With folders corresponding to subjects, [sessions] and MRI modalities.
Then, regarding the way filenames are constructed, we can identify 3 main types of elements:
Raw entities
Characterized by a key word (sub, ses, acq, etc.) and a value (label = an alphanumeric value, index = a nonnegative integer, etc) separated with a dash
-
sub-<label>
[ses-<label>]
[acq-<label>]
[ce-<label>]
[rec-<label>]
[run-<index>]
[part-<mag|phase|real|imag>]
[dir-<label>]
Entities are then separated with an underscore
_
Raw suffixes
An alphanumeric string located after all the entities following a final underscore
_
(i.e. the<suffix>
) --> corresponding in our cases to the MRI contrast:T1w
MP2RAGE
dwi
Raw extensions
Files extensions:
.nii.gz
.json
.bval
3. Important points regarding
derivatives
datasetsFirst, it is important to understand what are BIDS derivatives folders:
"Derivatives are outputs of common processing pipelines, capturing data and meta-data sufficient for a researcher to understand and (critically) reuse those outputs in subsequent processing. Standardizing derivatives is motivated by use cases where formalized machine-readable access to processed data enables higher level processing."
Basically, derivative folders have to be seen as processed data obtained from the raw dataset. With a derivative folder corresponding to DIFFERENT processes. Let's consider a fundamental workflow that aligns with the receipt of a new dataset. Let's imagine that we would like to generate spinal cord (SC) segmentions and intervertebral discs labels from this raw data using
sct_deepseg_sc
andsct_label_vertebrae
.Since the wanted outputs will be generated using TWO different algorithms, we will have to create TWO different folders under derivatives (e.g.
derivatives/labels_deepseg
andderivatives/labels_vertebrae
). Moreover, adataset_description.json
file will have to be added at the root of EACH derivative folder to keep track of all the operations applied to the data (processes used, manual corrections...).dataset_description.json
To keep a record of complex processing steps applied to the data, a descriptions.tsv file can be used.
descriptions.tsv
This file must be composed at least of two columns:desc_id
: labels corresponding to the desc entities (see Derivative entities below)description
: human readable descriptions of the processing stepsThis file MAY be located at the root of the derivative dataset, or at the subject or session level
Then, derivative folders follow the same structure as the
raw
folders:Derivatives structure
With folders corresponding to subjects, [sessions] and MRI modalities.
Finally, to construct the filenames, we can identify the same 3 type of elements as before (entities, suffixes and extensions) plus 1 extra-consideration related to the raw data:
source_entities
This element corresponds to the entire source filename, with the omission of the source suffix and extension.
Derivative entities
Characterized by a key word (space, res, den, etc.) and a value (label = an alphanumeric value, index = a nonnegative integer, etc) separated with a dash
-
[space-<space>]
: image space if different from raw space: template space (i.e. MNI305 etc), individual, study etc. (see BIDS for allowed spaces)[res-<label>]
: for changes in resolution[den-<label>]
[desc-<label>]
: should be used to specify the contrast (i.e._desc-T1w
and_desc-T2w
)[label-<label>]
: to avoid confusion if multiple masks are available we can specify the masked structure (i.e._label-WM
for white matter,_label-GM
for gray matter,_label-L
for lesions etc.)Entities are then separated with an underscore
_
Derivative suffixes
An alphanumeric string located after all the entities following a final underscore
_
:mask
for binary masks (0 and 1 only)dseg
for discrete segmentations representing multiple anatomical structuresprobseg
for probabilistic segmentations representing a single anatomical structure with values ranging from 0 to 1Derivatives extensions
Files extensions:
.nii.gz
.json
Neuropoly strategy
To overcome this standardization challenge, multiple steps have to be considered:
Neuropoly Dataset state 2023-11-13 slides
MRI Datasets
Micro Datasets
EEG Datasets
The text was updated successfully, but these errors were encountered: