Initial high-level goals and outline #2

rly · 2024-03-29T21:07:17Z

Goal: A slightly technical perspective piece describing the problem of representing arrays in schema and how array support within LinkML solves the problem

A rough first outline:

Background
- The historical challenges of representing arrays in schema with rich metadata (e.g., linkages to ontologies)
  - Previous work, including HDMF, CORAL, the NeXus data format for X-ray and neutron scattering and muon spectroscopy, netCDF?, OME-NGFF, geospatial data?, JSON schema arrays, previous approaches in LinkML
  - Many file formats for arrays - binary file, numpy npy/npz, HDF5, Zarr, N5, JSON, CSV/TSV, grib, tiff, fits
  - Many APIs for working with these formats - numpy, h5py, zarr, xarray, etc.
- Need a unifying framework across these technologies
  - What is LinkML
Adding array support in LinkML - some technical details
- Specification of simple and labeled arrays
- Generation of Pydantic API for arrays
- Dumpers and loaders from/to various file formats from/to various (python-based) APIs
- Validation
Examples in at least two applications
- Neurophysiology
- Bioimaging
- Environmental / geospatial
Discussion
- Challenges of open data in science
- Expanding coverage / use cases of LinkML

Potential target journals:

Nature Biotechnology, as Perspective
Nature Scientific Data, as Article
Elife, as Tools and Resources?
Gigascience
IEEE Big Data

We would write a second paper on LinkML arrays for NWB/neurophysiology, more for a neuroscience audience.

cmungall · 2024-04-12T22:59:34Z

https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.1.0/schema.md

oruebel · 2024-04-12T23:29:24Z

Just some links to a few other array-based file formats that may be of interest:

sneakers-the-rat · 2024-04-20T02:58:28Z

OK am back from vacation and ready to rumble.

made some child issues off this one to start tracking different pieces and make 'threads' for discussing those sub-points

sneakers-the-rat · 2024-05-03T22:31:19Z

Sry have been preoccupied with events on campus, will be returning to this next week. Numpydantic is near a 1.0 except for tests and docs, so its
Numpydantic 1.0 -> new array range generator -> update nwb-linkml to reflect.

The intro to the paper can happen async, but once we have those pieces in place we can do the meat of the results

sneakers-the-rat · 2024-05-25T02:50:35Z

hey every1. i just "officially" released numpydantic, so the next step now is to put that in linkml arrays generator (should only take ~a day) and then rework nwb-linkml (a ~week) before starting the paper.

unfortunately, my employer has decided to commit egregious unfair labor practices in the form of police violence against my students and colleagues, so starting next week I will be on strike and not doing any work that brings any benefit to my employer - and unfortunately my academic work is decidedly within the scope of struck work. The strike will last as long as June 30th when the grad student contracts expire, but may end sooner than that, pretty unclear at this point. hope u all understand

https://www.uaw4811.org/2024-ulp-charges

sneakers-the-rat · 2024-07-01T22:36:48Z

back at work and ready to roll. got numpydantic linkml PR open linkml/linkml#2178

and about to update nwb-linkml to reflect all the work we've done with linkml arrays. my goal is to get read and write working with HDF5 and zarr, and demonstrate a yaml-based NWB form like

my_dataset: !core.nwbfile.NWBFile
  # ... various metadata fields
  acquisition:
    probe_0_lfp: !core.ecephys.LFP
      probe_0_lfp_data: !core.ecephys.ElectricalSeries
        data:
          # ... metadata attributes
          array: # specify using a relative path and hash
            path: probe_0_lfp_data.zarr
            hash:
              value: # some long hash
              type: blake2b
          electrodes: 
            table: !reference /general/extracellular_ephys/electrodes
            array: [0, 1, 2, 3, 4, 5] # inline arrays should work the same as path references
  general:
    extracellular_ephys:
      electrodes:
        # ... and so on
  stimulus:
    presentation:
      my_stimulus_video: !core.base.TimeSeries
        data:
          array: 
            path: my_video.mp4  # videos behave the same as arrays
            hash: # ...

in some serialized form, just as an example. which is sort of like what's going on in hdmf_zarr except it's human readable, it can accept any kind of file (going to add an additional metadata field for plugins needed to read/write files), and it doesn't become dependent on hierarchical folder structure (that can still be used if it's helpful, but in general I think relying on directory structure in formats is another point of coupling between the abstract specification and the concrete implementation that seems to put a low cap on their expressiveness).

sneakers-the-rat · 2024-07-03T08:57:00Z

sneek preview of the schema models https://github.com/p2p-ld/nwb-linkml/tree/linkml-arrays/nwb_linkml/src/nwb_linkml/models/pydantic/core
and the generated schema: https://github.com/p2p-ld/nwb-linkml/tree/linkml-arrays/nwb_linkml/src/nwb_linkml/schema/linkml/core

that was super easy. need to make some changes in upstream linkml and i think i can scrap pretty much all of my monkeypatching of the generator too. then it's just a matter of writing dumpers and loaders.

one of those times where you thank yourself for overengineering something before bc it was extremely simple to just swap out the translation/generation routine here. i think u all are gonna get a kick out of nwb-linkml once it's working by how simple it is

sneakers-the-rat · 2024-10-21T21:16:59Z

Alright my dogs after a long delay, i'm gonna enter into writing mode on this starting wednesday. My goal (!!) is to shoot for 2 week writing sprint to get a draft in place, and i'm going to be doing additional demos and proofs of concept as i go.

What I want to shoot for

Demo of custom schemas alongside NWB
FastAPI demo with pydantic models & nwb
SQL db dump with sqlmodel (which will require an additional generator, but shouldn't be too tricky)

but as always this is just what's going into my draft, and so if y'all want to add other things then by all means. I'm going to be doing my own limited bit of history but I think @cmungall and @oruebel your input on the history of arrays in linked data as well as the need for the hdmf-schema and hdmf would be super valuable here :)

lets get this show on the road !!!!!

rly · 2024-11-01T14:21:38Z

Great! November for me has unexpectedly filled up with grant writing, but I will try to help with writing and reviewing where I can.

sneakers-the-rat · 2024-11-01T15:21:44Z

no problem. reading back the initial issue, i realize as i am drafting that i am mixing in some of the nwb stuff just because it's part of the same thought to me. i figure since we were planning on twinned pieces that we can rearrange and split the words on the page once they are there if we want to, but i am starting with the stuff that would be for the perspective piece re: the schema-centric linkml-arrays approach to a data standard

This was referenced Apr 20, 2024

Background - differentiation from other approaches #3

Open

Results - Examples #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial high-level goals and outline #2

Initial high-level goals and outline #2

rly commented Mar 29, 2024

cmungall commented Apr 12, 2024 •

edited

Loading

oruebel commented Apr 12, 2024

sneakers-the-rat commented Apr 20, 2024

sneakers-the-rat commented May 3, 2024

sneakers-the-rat commented May 25, 2024

sneakers-the-rat commented Jul 1, 2024

sneakers-the-rat commented Jul 3, 2024

sneakers-the-rat commented Oct 21, 2024

rly commented Nov 1, 2024

sneakers-the-rat commented Nov 1, 2024

Initial high-level goals and outline #2

Initial high-level goals and outline #2

Comments

rly commented Mar 29, 2024

cmungall commented Apr 12, 2024 • edited Loading

oruebel commented Apr 12, 2024

sneakers-the-rat commented Apr 20, 2024

sneakers-the-rat commented May 3, 2024

sneakers-the-rat commented May 25, 2024

sneakers-the-rat commented Jul 1, 2024

sneakers-the-rat commented Jul 3, 2024

sneakers-the-rat commented Oct 21, 2024

rly commented Nov 1, 2024

sneakers-the-rat commented Nov 1, 2024

cmungall commented Apr 12, 2024 •

edited

Loading