Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: One file/one jam paradigm not suited for large datasets #86

Open
hendriks73 opened this issue Nov 6, 2015 · 3 comments
Open

RFE: One file/one jam paradigm not suited for large datasets #86

hendriks73 opened this issue Nov 6, 2015 · 3 comments
Labels
interoperability Making JAMS play nice with other packages schema Issues pertaining to schema definitions
Milestone

Comments

@hendriks73
Copy link
Contributor

When dealing with large datasets containing 100,000s of files, the one file/one jam paradigm becomes a burden.

  1. Filesystems like ext2 aren't made for this.
  2. With all of those files being relatively small, internal fragmentation leads to wasted disk space (e.g. on my Mac, a 994 bytes file uses 4kb of disk space).
  3. Adding that many files to GitHub is questionable.

Possible solutions:

  1. Create a jam archive format, similar to jar, zip or .tar.bz2 (well supported by Python). This allows to keep the one file/one jam paradigm, while not polluting the filesystem.
  2. Redefine the jam file format so that it can hold data for multiple files. This would also allow us to re-use annotation_metadata for multiple files and not be so repetitive, if desired.

It would be nice, if, whatever format is chosen, it was readable by jams right away without having to manually split it into a gazillion parts.

@bmcfee bmcfee added the question label Nov 6, 2015
@bmcfee
Copy link
Contributor

bmcfee commented Nov 6, 2015

This is quasi redundant to #40 , but different enough that we can talk about multi-file issues specifically here. (#40 was concerned more with intersecting collections against other files on disk, eg audio.)

On the one hand, I agree that having zillions of jams files on disk is undesirable for all the reasons @hendriks73 mentioned. Clearly, some kind of higher-level container is necessary. (FWIW, I think @ejhumphrey has tinkered with this idea offline.)

On the other hand, I don't think encoding entire collections within a single json object is feasible, since it would require the entire collection to be read and deserialized before a single object can be retrieved. For collections of the size you mention, this would most likely choke all json parsers. It might be possible to swap out the backend to a more incremental storage format (protobuf maybe?), but that seems pretty heavy-handed as well.

For now, I think the best approach to managing large collections of jams is to use a general purpose container/database (eg mongo or postgres). Since jams 0.2 can now (de)serialize from file-like objects (and not just files), there's no technical barrier to loading a jams object out of a database/web socket/whatever.

As for point 2 above: I think it's better to err on the side of redundancy in metadata, rather than minimality.

@bmcfee
Copy link
Contributor

bmcfee commented Feb 1, 2016

Update: see #97

@bmcfee bmcfee added interoperability Making JAMS play nice with other packages schema Issues pertaining to schema definitions and removed question labels Aug 12, 2019
@bmcfee
Copy link
Contributor

bmcfee commented Aug 12, 2019

I know it's only been 4 years here 🙄 but this is still an issue.

The current plan, time permitting, is to refactor the schema #92 so that the individual classes can be validated independently. This would allow us to store JAMSy objects in a key-value store (eg mongodb) and validate independently. What we currently think of as a JAMS object would then be a view of a particular query against that KV store which aggregates all information relating to a specific track identifier.

Aside from refactoring the schema, this will require a bit more code to interact with a backend database.

@bmcfee bmcfee added this to the 0.4.0 milestone Aug 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interoperability Making JAMS play nice with other packages schema Issues pertaining to schema definitions
Projects
None yet
Development

No branches or pull requests

2 participants