Maybe improve our database #52

FynnBe · 2024-04-25T07:29:07Z

Our database is now a series of files on S3.
So far this is sufficient and the minio (python) client (and our Client wrapper around it) allow for convenient access and inspection of our database.
We might want to look into more standard approaches, so this issue serves as a place to take notes and discuss this eventually.

idea1: We could create an index DB for our collection:
https://aws.amazon.com/de/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/

The text was updated successfully, but these errors were encountered:

FynnBe · 2024-04-25T07:52:16Z

SQL model might be another fit for an index db: https://github.com/tiangolo/sqlmodel

oeway · 2024-04-25T11:21:39Z

Database are useful when we have complex queries, like find all the dataset linked to model a which also applies to model b like operations. Right now, it is enough to just go for the s3 files, the summary file are essentially for search from the website, and we have a clear submission, and publishing workflow, each steps are distinctive. So we won't really need a dedicated database.

Separate files on s3, or a single file in the database are two different approach, for now I would stick with s3, since it's much easy to make changes to individual files without impacting all the records, while editing database files are much less straight forward, and require more attention in backup the database, migrating the database etc.

If you have both s3 files and database, then we are creating two sources of truth. If we end up needing a database, e.g. create a hypha service for advanced model search, I would built the database on the fly from s3 files and use s3 as the truth data source.

Plus, we don't really need a dedicated database, since S3 also support SQL syntax for searching over json files. See s3-select: https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html

FynnBe · 2024-04-25T11:41:20Z

creating an "index database" on the fly was more what I had in mind... but this should be left for future optimization in any case.
I suppose in the long term we could replace the collection.json with such a light database, but we should be fine with the json for quite some time 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maybe improve our database #52

Maybe improve our database #52

FynnBe commented Apr 25, 2024

FynnBe commented Apr 25, 2024

oeway commented Apr 25, 2024

FynnBe commented Apr 25, 2024

Maybe improve our database #52

Maybe improve our database #52

Comments

FynnBe commented Apr 25, 2024

FynnBe commented Apr 25, 2024

oeway commented Apr 25, 2024

FynnBe commented Apr 25, 2024