BIDSification of Winkler et al. dataset and other datasets for training/validation of ICA labeling models #8

adam2392 · 2022-04-05T03:16:32Z

The dataset referenced in https://github.com/agramfort/artifact-learn/issues/1#issuecomment-906141483 has good and bad components labeled after ICA.

To facilitate easy training/testing, it would be good to construct a BIDsification script to convert the dataset into BIDs format sometime using mne-bids.

cc: @jacobf18 @mscheltienne @anandsaini024

adam2392 · 2022-04-05T03:17:34Z

For that matter, bidsification of as many datasets as possible, but organized semantically according to the criterion that was specified in #5 (comment) (i.e. different recording montages, hardware, etc.)

mscheltienne · 2022-04-05T08:48:14Z

I can have a look at the bidsification of this dataset, as I also will have to do the ANT and EGI ones as well anyway.
I never used mne-bids yet, but that was on my to-do list for a long time 😄

adam2392 · 2022-04-06T14:18:49Z

Some things to think about:

upper bound on the number of seconds included in an ICA decomposition (e.g. a minute)

mscheltienne · 2022-04-13T14:00:02Z

For the raw EEG datasets that can be used for benchmarking, I have:

ANT Neuro: I included 19 recordings of 4 minutes for now, and I can add more later.
EGI: Still working on bidsification.

It would be good to centralize those datasets, any idea where?
The ANT Neuro one is small (less than 1 Gb) but the EGI one is above 30 Gb. I will crop the files during bidsification.

Additional points to keep in mind for the processing of those datasets into labelled ICs:

BP filter applied before ICA. (1, 100) Hz? I think that's what is used to train ICLabel.
Number of components included in the ICA decomposition
ICA algorithm used

For winkler et al. I looked quickly and the ICA decomposition seems to use: https://www.jmlr.org/papers/volume5/ziehe04a/ziehe04a.pdf
I'm not familiar with this algo, so I have to go over the paper to figure out what A_ffdiag and W_ffdiag represents.
Each file has the following attributes:

A_ffdiag (n x 30), W_ffdiag (30 x 3) with n around 120-ish
cnt: I think that's information about the raw data
filename: oddball is the only keyword I recognize in file names like 'oddball_fasor_B3_K2_VPii'
goodcomp: either (1, k) or (k, ) (with k up to 30) containing the IDx of the good components. 0-index or 1-index?
mnt: dunno
nComps: dunno.. int from 30 to 810?

We basically have 30 IC / file labeled.

But anyway, for bidsification, I looked into the specification derivatives for electrophysiological data, and the only reference I find is https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/edit#heading=h.f548zgpgxhiu so it's still at an extension proposal stage? I'll follow that for now, but if you know of another specification for ICA, I'm looking for it.

adam2392 · 2022-04-13T15:07:25Z

It would be good to centralize those datasets, any idea where? The ANT Neuro one is small (less than 1 Gb) but the EGI one is above 30 Gb. I will crop the files during bidsification.

Perhaps for now, we can share via OneDrive, or Dropbox? I have access to OneDrive via institution still and can setup that if you guys don't have Dropbox pro?

Open to other ideas too.

I think long-term we want to store it on openneuro.org if that's okay? We might actually be able to leverage openneuro.org right away if you're okay with it. We can store and create private BIDSified datasets that we can then pull from even. I think programmatic access would require the dataset to be public(?)

But anyway, for bidsification, I looked into the specification derivatives for electrophysiological data, and the only reference I find is https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/edit#heading=h.f548zgpgxhiu so it's still at an extension proposal stage? I'll follow that for now, but if you know of another specification for ICA, I'm looking for it.

Yeah there is no agreed-upon spec yet for ICA, but I think we should just follow the "format" that is suggested for derivatives and ICA. E.g. filenaming and directory structure at the very minimum. We can store files in the ICA format output by MNE-Python.

mscheltienne · 2022-04-13T15:26:07Z

Good idea for openneuro.org, I'll check if we can make those datasets public. Else OneDrive is a good option.
Sounds reasonable for the ICA, let's go with FIFF then, I'll try to convert those .mat files.

adam2392 · 2022-05-04T14:23:28Z

Other datasets: https://github.com/agramfort/artifact-learn/tree/master/1-%20extract%20basic%20info%20from%20databases

looks like we can probably download online?

adam2392 · 2022-06-14T14:50:18Z

For more data and inspiration, we can look at this review paper: https://iopscience.iop.org/article/10.1088/1741-2560/12/3/031001/pdf

adam2392 · 2022-06-15T19:25:22Z

For references:
mara -> BIDS: https://gist.github.com/mscheltienne/680f46336aec8a0408f30952c3d72e8d

ANT to BIDS: https://gist.github.com/mscheltienne/fe3dcc7dafef7539018a6a00ba73afed

adam2392 · 2022-06-16T13:52:32Z

See: https://github.com/adam2392/improve_icalabel for now scripts centralized into one repo.

adam2392 · 2022-06-27T15:03:50Z

Some open questions that we can defer to later of course:

how do we annotate components without the raw data?
the issue in mne-python is that it requires an inst in order to apply the fitted ICA to get the estimated "sources". We need to probably hack an API for interfacing with the stored ICA time-series. Perhaps we just use them as RawArray?
how to run the benchmarks on these two settings?

@anandsaini024 do you have any existing code for building out benchmark models that you want to push up to the improve-icalabel repo?

Anything that you are able to work on while we sort out the GUI and pipeline for annotating the data?

mscheltienne · 2022-06-27T19:43:43Z

+1 for the IC time-series as RawArray, with an extension e.g. '*-sources-raw.fif' for the MARA dataset.

In this convert function:

https://github.com/adam2392/improve_icalabel/blob/96522dacd045a5caa50f7f4653d9fd988a29bfa1/mnestudy/ica_to_bids/mara.py#L35-L44
For each iteration, the missing steps are to save the ICA ica with an extension -ica.fif, save the IC time-series sources as a RawArray, save the good/bad components in a sidecar with the annotation function you added (all that at the correct BIDS Path).

@anandsaini024 This is what I briefly described to you this evening. It would be great if you could finish this conversion function.

anandsaini024 · 2022-06-27T22:43:27Z

Alright, I will pick this up.

adam2392 · 2022-06-28T02:10:53Z

@anandsaini024 have you preprocessed the ANTS dataset already? I am going to use one of the subjects as a test subject for the hs student to QA his annotations.

If you did, do you mind pushing up the script to mnestudy/ica_to_bids/ants.py or somewhere there?

mscheltienne · 2022-06-28T14:34:16Z

@adam2392 You should have received something from openneuro on your Gmail for dataset ds004178. It contains the ANT Neuro raw files and the preprocessed (pp) files with their ICA decomposition. It is however an automatic pipeline, so please have a look at the preprocessed data when you load it. Also, I did not check when I picked up those files, but maybe some of them are bad recordings with a lot of bridged electrodes (it does happen from time to time). If this is the case, then the recording has to be excluded and I can provide a different one instead.

Note: I deleted the old dataset with only raw data and replaced it with this one.. I did not figure out how to easily update the existing dataset 🤯

adam2392 · 2022-06-28T14:46:15Z

Can you share with me again?

Yeah updating is a pain. Adding files is easy, Deleting files is kind of a pain. Modifying files is a super pain.

Then my plan for the hs student is to:

determine bad electrodes in the ANT raw dataset
run ICA (with my help)
determine ICA components
maybe do some benchmarking using existing sklearn classifiers (with my help)

It seems we might not be able to get him to fully annotate the ICA components as desired, but hopefully we can get at least some of the raw annotated.

mscheltienne · 2022-06-28T15:07:37Z

So.. the dataset does not appear even on my account.. except if I explicitly enter the corresponding URL.
Indeed, in the "share" tab you were not appearing anymore.. I've sent again the share invite. Let me know..

adam2392 · 2022-06-28T15:48:41Z

I unfortunately did not. Perhaps it just didn't finish uploading yet?

Openneuro even tho it's "nice" seems pretty buggy -__-

mscheltienne · 2022-06-28T16:27:08Z

Yep.. I had multiple issues with it recently.
It did finish uploading for sure (I was very careful about that :p).. I'll upload it to dropbox or GoogleDrive this evening and I'll share it on your Gmail account.

adam2392 · 2022-06-28T17:22:43Z

Oh I see it now on openneuro :p

mscheltienne · 2022-06-28T17:30:39Z

Same it finally popped up on my account..
So just to be cleared the proc-raw are raw data as it comes out of the amplifier; and the proc-pp and proc-ica are what comes out of https://github.com/adam2392/improve_icalabel/tree/master/mnestudy/raw_to_bids
All 3 files are actually generated from a non-BIDS compliment dataset with this script: https://github.com/adam2392/improve_icalabel/blob/master/mnestudy/raw_to_bids/ant.py

adam2392 · 2022-06-29T01:48:51Z

Agenda for tomorrow so I don't forget:

sharing of the preprocessed MARA dataset.
ICLabel existing dataset?
GUI
preprocessing workflow for ANTs

Outcome action items:

-> @anandsaini024 to share on Dropbox?
Let's not use the ICLabel dataset because its not very well-coded. We agreed to just use it for the port of ICLabel.
GUI @mscheltienne can possibly get this done by the time Adam comes back July 10thish.
So we want Aaron (@ayoun25) to do a double-checking of the proc-pp dataset for determining if the output for bad/good electrodes was done correctly by the automated pipeline. Note: if any of the datasets have more then like... 10 bridge electrodes, notify @mscheltienne.

chmendoza · 2022-11-23T20:26:56Z

Hello,

Thank you for mne-iclabel! I would like to test a new IC feature extracted from the time series in training a multi-class IC classifier (ideally, more than 2 types of ICs).

Is the preprocessed MARA dataset publicly available?
Is there any other dataset you would recommend that has more than two IC classes and has expert-annotated labels?
If there is no such dataset, which standard EEG datasets would you recommend to run ICLabel on it and use that as noisy labels to train my classifier?

Thank you!! 😃

jacobf18 · 2022-11-24T19:04:05Z

@chmendoza As far as I know, there is no large publicly available dataset for IC classification. We were working on processing a dataset (referenced above) to test the IC classification. The feature/label dataset for ICLabel is available, but not the original IC's. That dataset is available here: https://github.com/lucapton/ICLabel-Dataset.

adam2392 · 2022-11-28T16:06:35Z

Thank you for mne-iclabel! I would like to test a new IC feature extracted from the time series in training a multi-class IC classifier (ideally, more than 2 types of ICs).

@chmendoza Feel free to make a separate GH issue/PR, if/when your model is ready for review. We would love to include this into MNE-ICALabel to propagate it to the MNE community.

adam2392 changed the title ~~BIDSification of Winkler et al. dataset~~ BIDSification of Winkler et al. dataset and other datasets for training/validation of ICA labeling models May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BIDSification of Winkler et al. dataset and other datasets for training/validation of ICA labeling models #8

BIDSification of Winkler et al. dataset and other datasets for training/validation of ICA labeling models #8

adam2392 commented Apr 5, 2022 •

edited

Loading

adam2392 commented Apr 5, 2022

mscheltienne commented Apr 5, 2022

adam2392 commented Apr 6, 2022

mscheltienne commented Apr 13, 2022 •

edited

Loading

adam2392 commented Apr 13, 2022 •

edited

Loading

mscheltienne commented Apr 13, 2022

adam2392 commented May 4, 2022

adam2392 commented Jun 14, 2022

adam2392 commented Jun 15, 2022

adam2392 commented Jun 16, 2022

adam2392 commented Jun 27, 2022

mscheltienne commented Jun 27, 2022

anandsaini024 commented Jun 27, 2022

adam2392 commented Jun 28, 2022 •

edited

Loading

mscheltienne commented Jun 28, 2022

adam2392 commented Jun 28, 2022

mscheltienne commented Jun 28, 2022

adam2392 commented Jun 28, 2022

mscheltienne commented Jun 28, 2022

adam2392 commented Jun 28, 2022

mscheltienne commented Jun 28, 2022 •

edited

Loading

adam2392 commented Jun 29, 2022 •

edited

Loading

chmendoza commented Nov 23, 2022

jacobf18 commented Nov 24, 2022

adam2392 commented Nov 28, 2022 •

edited

Loading

BIDSification of Winkler et al. dataset and other datasets for training/validation of ICA labeling models #8

BIDSification of Winkler et al. dataset and other datasets for training/validation of ICA labeling models #8

Comments

adam2392 commented Apr 5, 2022 • edited Loading

adam2392 commented Apr 5, 2022

mscheltienne commented Apr 5, 2022

adam2392 commented Apr 6, 2022

mscheltienne commented Apr 13, 2022 • edited Loading

adam2392 commented Apr 13, 2022 • edited Loading

mscheltienne commented Apr 13, 2022

adam2392 commented May 4, 2022

adam2392 commented Jun 14, 2022

adam2392 commented Jun 15, 2022

adam2392 commented Jun 16, 2022

adam2392 commented Jun 27, 2022

mscheltienne commented Jun 27, 2022

anandsaini024 commented Jun 27, 2022

adam2392 commented Jun 28, 2022 • edited Loading

mscheltienne commented Jun 28, 2022

adam2392 commented Jun 28, 2022

mscheltienne commented Jun 28, 2022

adam2392 commented Jun 28, 2022

mscheltienne commented Jun 28, 2022

adam2392 commented Jun 28, 2022

mscheltienne commented Jun 28, 2022 • edited Loading

adam2392 commented Jun 29, 2022 • edited Loading

chmendoza commented Nov 23, 2022

jacobf18 commented Nov 24, 2022

adam2392 commented Nov 28, 2022 • edited Loading

adam2392 commented Apr 5, 2022 •

edited

Loading

mscheltienne commented Apr 13, 2022 •

edited

Loading

adam2392 commented Apr 13, 2022 •

edited

Loading

adam2392 commented Jun 28, 2022 •

edited

Loading

mscheltienne commented Jun 28, 2022 •

edited

Loading

adam2392 commented Jun 29, 2022 •

edited

Loading

adam2392 commented Nov 28, 2022 •

edited

Loading