RFE: One file/one jam paradigm not suited for large datasets #86

hendriks73 · 2015-11-06T09:35:57Z

When dealing with large datasets containing 100,000s of files, the one file/one jam paradigm becomes a burden.

Filesystems like ext2 aren't made for this.
With all of those files being relatively small, internal fragmentation leads to wasted disk space (e.g. on my Mac, a 994 bytes file uses 4kb of disk space).
Adding that many files to GitHub is questionable.

Possible solutions:

Create a jam archive format, similar to jar, zip or .tar.bz2 (well supported by Python). This allows to keep the one file/one jam paradigm, while not polluting the filesystem.
Redefine the jam file format so that it can hold data for multiple files. This would also allow us to re-use annotation_metadata for multiple files and not be so repetitive, if desired.

It would be nice, if, whatever format is chosen, it was readable by jams right away without having to manually split it into a gazillion parts.

bmcfee · 2015-11-06T13:12:25Z

This is quasi redundant to #40 , but different enough that we can talk about multi-file issues specifically here. (#40 was concerned more with intersecting collections against other files on disk, eg audio.)

On the one hand, I agree that having zillions of jams files on disk is undesirable for all the reasons @hendriks73 mentioned. Clearly, some kind of higher-level container is necessary. (FWIW, I think @ejhumphrey has tinkered with this idea offline.)

On the other hand, I don't think encoding entire collections within a single json object is feasible, since it would require the entire collection to be read and deserialized before a single object can be retrieved. For collections of the size you mention, this would most likely choke all json parsers. It might be possible to swap out the backend to a more incremental storage format (protobuf maybe?), but that seems pretty heavy-handed as well.

For now, I think the best approach to managing large collections of jams is to use a general purpose container/database (eg mongo or postgres). Since jams 0.2 can now (de)serialize from file-like objects (and not just files), there's no technical barrier to loading a jams object out of a database/web socket/whatever.

As for point 2 above: I think it's better to err on the side of redundancy in metadata, rather than minimality.

bmcfee · 2016-02-01T18:31:14Z

Update: see #97

bmcfee · 2019-08-12T16:38:05Z

I know it's only been 4 years here 🙄 but this is still an issue.

The current plan, time permitting, is to refactor the schema #92 so that the individual classes can be validated independently. This would allow us to store JAMSy objects in a key-value store (eg mongodb) and validate independently. What we currently think of as a JAMS object would then be a view of a particular query against that KV store which aggregates all information relating to a specific track identifier.

Aside from refactoring the schema, this will require a bit more code to interact with a backend database.

bmcfee added the question label Nov 6, 2015

bmcfee added interoperability Making JAMS play nice with other packages schema Issues pertaining to schema definitions and removed question labels Aug 12, 2019

bmcfee added this to the 0.4.0 milestone Aug 12, 2019

bmcfee mentioned this issue Apr 22, 2020

Next generation jams #208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: One file/one jam paradigm not suited for large datasets #86

RFE: One file/one jam paradigm not suited for large datasets #86

hendriks73 commented Nov 6, 2015

bmcfee commented Nov 6, 2015

bmcfee commented Feb 1, 2016

bmcfee commented Aug 12, 2019

RFE: One file/one jam paradigm not suited for large datasets #86

RFE: One file/one jam paradigm not suited for large datasets #86

Comments

hendriks73 commented Nov 6, 2015

bmcfee commented Nov 6, 2015

bmcfee commented Feb 1, 2016

bmcfee commented Aug 12, 2019