Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe improve our database #52

Open
FynnBe opened this issue Apr 25, 2024 · 3 comments
Open

Maybe improve our database #52

FynnBe opened this issue Apr 25, 2024 · 3 comments

Comments

@FynnBe
Copy link
Member

FynnBe commented Apr 25, 2024

Our database is now a series of files on S3.
So far this is sufficient and the minio (python) client (and our Client wrapper around it) allow for convenient access and inspection of our database.
We might want to look into more standard approaches, so this issue serves as a place to take notes and discuss this eventually.

@FynnBe
Copy link
Member Author

FynnBe commented Apr 25, 2024

SQL model might be another fit for an index db: https://github.com/tiangolo/sqlmodel

@oeway
Copy link
Contributor

oeway commented Apr 25, 2024

Database are useful when we have complex queries, like find all the dataset linked to model a which also applies to model b like operations. Right now, it is enough to just go for the s3 files, the summary file are essentially for search from the website, and we have a clear submission, and publishing workflow, each steps are distinctive. So we won't really need a dedicated database.

Separate files on s3, or a single file in the database are two different approach, for now I would stick with s3, since it's much easy to make changes to individual files without impacting all the records, while editing database files are much less straight forward, and require more attention in backup the database, migrating the database etc.

If you have both s3 files and database, then we are creating two sources of truth. If we end up needing a database, e.g. create a hypha service for advanced model search, I would built the database on the fly from s3 files and use s3 as the truth data source.

Plus, we don't really need a dedicated database, since S3 also support SQL syntax for searching over json files. See s3-select: https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html

@FynnBe
Copy link
Member Author

FynnBe commented Apr 25, 2024

creating an "index database" on the fly was more what I had in mind... but this should be left for future optimization in any case.
I suppose in the long term we could replace the collection.json with such a light database, but we should be fine with the json for quite some time 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants