Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BIDSification of Winkler et al. dataset and other datasets for training/validation of ICA labeling models #8

Open
adam2392 opened this issue Apr 5, 2022 · 25 comments

Comments

@adam2392
Copy link
Member

adam2392 commented Apr 5, 2022

The dataset referenced in https://github.com/agramfort/artifact-learn/issues/1#issuecomment-906141483 has good and bad components labeled after ICA.

To facilitate easy training/testing, it would be good to construct a BIDsification script to convert the dataset into BIDs format sometime using mne-bids.

cc: @jacobf18 @mscheltienne @anandsaini024

@adam2392
Copy link
Member Author

adam2392 commented Apr 5, 2022

For that matter, bidsification of as many datasets as possible, but organized semantically according to the criterion that was specified in #5 (comment) (i.e. different recording montages, hardware, etc.)

@mscheltienne
Copy link
Member

I can have a look at the bidsification of this dataset, as I also will have to do the ANT and EGI ones as well anyway.
I never used mne-bids yet, but that was on my to-do list for a long time 😄

@adam2392
Copy link
Member Author

adam2392 commented Apr 6, 2022

Some things to think about:

  • upper bound on the number of seconds included in an ICA decomposition (e.g. a minute)

@mscheltienne
Copy link
Member

mscheltienne commented Apr 13, 2022

For the raw EEG datasets that can be used for benchmarking, I have:

  • ANT Neuro: I included 19 recordings of 4 minutes for now, and I can add more later.
  • EGI: Still working on bidsification.

It would be good to centralize those datasets, any idea where?
The ANT Neuro one is small (less than 1 Gb) but the EGI one is above 30 Gb. I will crop the files during bidsification.


Additional points to keep in mind for the processing of those datasets into labelled ICs:

  • BP filter applied before ICA. (1, 100) Hz? I think that's what is used to train ICLabel.
  • Number of components included in the ICA decomposition
  • ICA algorithm used

For winkler et al. I looked quickly and the ICA decomposition seems to use: https://www.jmlr.org/papers/volume5/ziehe04a/ziehe04a.pdf
I'm not familiar with this algo, so I have to go over the paper to figure out what A_ffdiag and W_ffdiag represents.
Each file has the following attributes:

  • A_ffdiag (n x 30), W_ffdiag (30 x 3) with n around 120-ish
  • cnt: I think that's information about the raw data
  • filename: oddball is the only keyword I recognize in file names like 'oddball_fasor_B3_K2_VPii'
  • goodcomp: either (1, k) or (k, ) (with k up to 30) containing the IDx of the good components. 0-index or 1-index?
  • mnt: dunno
  • nComps: dunno.. int from 30 to 810?

We basically have 30 IC / file labeled.

But anyway, for bidsification, I looked into the specification derivatives for electrophysiological data, and the only reference I find is https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/edit#heading=h.f548zgpgxhiu so it's still at an extension proposal stage? I'll follow that for now, but if you know of another specification for ICA, I'm looking for it.

@adam2392
Copy link
Member Author

adam2392 commented Apr 13, 2022

It would be good to centralize those datasets, any idea where? The ANT Neuro one is small (less than 1 Gb) but the EGI one is above 30 Gb. I will crop the files during bidsification.

Perhaps for now, we can share via OneDrive, or Dropbox? I have access to OneDrive via institution still and can setup that if you guys don't have Dropbox pro?

Open to other ideas too.

I think long-term we want to store it on openneuro.org if that's okay? We might actually be able to leverage openneuro.org right away if you're okay with it. We can store and create private BIDSified datasets that we can then pull from even. I think programmatic access would require the dataset to be public(?)

But anyway, for bidsification, I looked into the specification derivatives for electrophysiological data, and the only reference I find is https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/edit#heading=h.f548zgpgxhiu so it's still at an extension proposal stage? I'll follow that for now, but if you know of another specification for ICA, I'm looking for it.

Yeah there is no agreed-upon spec yet for ICA, but I think we should just follow the "format" that is suggested for derivatives and ICA. E.g. filenaming and directory structure at the very minimum. We can store files in the ICA format output by MNE-Python.

@mscheltienne
Copy link
Member

Good idea for openneuro.org, I'll check if we can make those datasets public. Else OneDrive is a good option.
Sounds reasonable for the ICA, let's go with FIFF then, I'll try to convert those .mat files.

@adam2392
Copy link
Member Author

adam2392 commented May 4, 2022

Other datasets: https://github.com/agramfort/artifact-learn/tree/master/1-%20extract%20basic%20info%20from%20databases

looks like we can probably download online?

@adam2392 adam2392 changed the title BIDSification of Winkler et al. dataset BIDSification of Winkler et al. dataset and other datasets for training/validation of ICA labeling models May 4, 2022
@adam2392
Copy link
Member Author

For more data and inspiration, we can look at this review paper: https://iopscience.iop.org/article/10.1088/1741-2560/12/3/031001/pdf

@adam2392
Copy link
Member Author

@adam2392
Copy link
Member Author

See: https://github.com/adam2392/improve_icalabel for now scripts centralized into one repo.

@adam2392
Copy link
Member Author

Some open questions that we can defer to later of course:

  • how do we annotate components without the raw data?
  • the issue in mne-python is that it requires an inst in order to apply the fitted ICA to get the estimated "sources". We need to probably hack an API for interfacing with the stored ICA time-series. Perhaps we just use them as RawArray?
  • how to run the benchmarks on these two settings?

@anandsaini024 do you have any existing code for building out benchmark models that you want to push up to the improve-icalabel repo?

Anything that you are able to work on while we sort out the GUI and pipeline for annotating the data?

@mscheltienne
Copy link
Member

+1 for the IC time-series as RawArray, with an extension e.g. '*-sources-raw.fif' for the MARA dataset.

In this convert function:

https://github.com/adam2392/improve_icalabel/blob/96522dacd045a5caa50f7f4653d9fd988a29bfa1/mnestudy/ica_to_bids/mara.py#L35-L44
For each iteration, the missing steps are to save the ICA ica with an extension -ica.fif, save the IC time-series sources as a RawArray, save the good/bad components in a sidecar with the annotation function you added (all that at the correct BIDS Path).

@anandsaini024 This is what I briefly described to you this evening. It would be great if you could finish this conversion function.

@anandsaini024
Copy link
Contributor

Alright, I will pick this up.

@adam2392
Copy link
Member Author

adam2392 commented Jun 28, 2022

@anandsaini024 have you preprocessed the ANTS dataset already? I am going to use one of the subjects as a test subject for the hs student to QA his annotations.

If you did, do you mind pushing up the script to mnestudy/ica_to_bids/ants.py or somewhere there?

@mscheltienne
Copy link
Member

@adam2392 You should have received something from openneuro on your Gmail for dataset ds004178. It contains the ANT Neuro raw files and the preprocessed (pp) files with their ICA decomposition. It is however an automatic pipeline, so please have a look at the preprocessed data when you load it. Also, I did not check when I picked up those files, but maybe some of them are bad recordings with a lot of bridged electrodes (it does happen from time to time). If this is the case, then the recording has to be excluded and I can provide a different one instead.

Note: I deleted the old dataset with only raw data and replaced it with this one.. I did not figure out how to easily update the existing dataset 🤯

@adam2392
Copy link
Member Author

Can you share with me again?

Yeah updating is a pain. Adding files is easy, Deleting files is kind of a pain. Modifying files is a super pain.

Then my plan for the hs student is to:

  • determine bad electrodes in the ANT raw dataset
  • run ICA (with my help)
  • determine ICA components
  • maybe do some benchmarking using existing sklearn classifiers (with my help)

It seems we might not be able to get him to fully annotate the ICA components as desired, but hopefully we can get at least some of the raw annotated.

@mscheltienne
Copy link
Member

So.. the dataset does not appear even on my account.. except if I explicitly enter the corresponding URL.
Indeed, in the "share" tab you were not appearing anymore.. I've sent again the share invite. Let me know..

@adam2392
Copy link
Member Author

I unfortunately did not. Perhaps it just didn't finish uploading yet?

Openneuro even tho it's "nice" seems pretty buggy -__-

@mscheltienne
Copy link
Member

Yep.. I had multiple issues with it recently.
It did finish uploading for sure (I was very careful about that :p).. I'll upload it to dropbox or GoogleDrive this evening and I'll share it on your Gmail account.

@adam2392
Copy link
Member Author

Oh I see it now on openneuro :p

@mscheltienne
Copy link
Member

mscheltienne commented Jun 28, 2022

Same it finally popped up on my account..
So just to be cleared the proc-raw are raw data as it comes out of the amplifier; and the proc-pp and proc-ica are what comes out of https://github.com/adam2392/improve_icalabel/tree/master/mnestudy/raw_to_bids
All 3 files are actually generated from a non-BIDS compliment dataset with this script: https://github.com/adam2392/improve_icalabel/blob/master/mnestudy/raw_to_bids/ant.py

@adam2392
Copy link
Member Author

adam2392 commented Jun 29, 2022

Agenda for tomorrow so I don't forget:

  1. sharing of the preprocessed MARA dataset.
  2. ICLabel existing dataset?
  3. GUI
  4. preprocessing workflow for ANTs

Outcome action items:

  1. -> @anandsaini024 to share on Dropbox?
  2. Let's not use the ICLabel dataset because its not very well-coded. We agreed to just use it for the port of ICLabel.
  3. GUI @mscheltienne can possibly get this done by the time Adam comes back July 10thish.
  4. So we want Aaron (@ayoun25) to do a double-checking of the proc-pp dataset for determining if the output for bad/good electrodes was done correctly by the automated pipeline. Note: if any of the datasets have more then like... 10 bridge electrodes, notify @mscheltienne.

@chmendoza
Copy link

Hello,

Thank you for mne-iclabel! I would like to test a new IC feature extracted from the time series in training a multi-class IC classifier (ideally, more than 2 types of ICs).

  • Is the preprocessed MARA dataset publicly available?
  • Is there any other dataset you would recommend that has more than two IC classes and has expert-annotated labels?
  • If there is no such dataset, which standard EEG datasets would you recommend to run ICLabel on it and use that as noisy labels to train my classifier?

Thank you!! 😃

@jacobf18
Copy link
Collaborator

@chmendoza As far as I know, there is no large publicly available dataset for IC classification. We were working on processing a dataset (referenced above) to test the IC classification. The feature/label dataset for ICLabel is available, but not the original IC's. That dataset is available here: https://github.com/lucapton/ICLabel-Dataset.

@adam2392
Copy link
Member Author

adam2392 commented Nov 28, 2022

Thank you for mne-iclabel! I would like to test a new IC feature extracted from the time series in training a multi-class IC classifier (ideally, more than 2 types of ICs).

@chmendoza Feel free to make a separate GH issue/PR, if/when your model is ready for review. We would love to include this into MNE-ICALabel to propagate it to the MNE community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants