-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructuring the repo #200
Comments
Update of the DB at GUI server is maintained by the special script, which now run by cron. Having a submodule If we move Data to another repository, we will also have to make some changes in GUI part, when they recalculate paths of Git-stored files at client-level (when plotting graphs, for example). But that we can schedule with @fsuarezleston. We can first copy Data to the other repo. Then wait for fixes, and only then remove Data from the main repo. Having Data in cloud is also a temporary solution. I don't see much difference from using GitHub or GitLab versioning against DVC. But I never used DVC, so I don't have a very strong opinion. But I suppose that solid solution - is to store data in the database and access it via some service. I suppose, repositories like PDB are organized like that. Noone stores this kind of data just as versioned files. But that is too hard for a pet-project - that should be outsourced to engineers after getting some funding for the entire project. But removing data from the code repository is anyway an inevitable step on a way to the solid solution. One thing we should agree on: where do we store auxiliary files like mappings and UA-jsons. I suppose they should be in the code-part. And regarding info files - I would just remove them. I don't see any profit of storing info-s. |
just out of curiosity: since the DB for the GUI hosted somewhere, can't we also store the data in the same server?
Same question as above: as we already have the DB that serves the GUI, can't we simply build an CLI/API around it? |
It's not a problem to host database somewhere, and on this server as well. The problem is that this database was build just for GUI presentation. It doesn't contain complete data from There I put also the DB schema in the SQL file, so you can see how the tables are organized. Current database also doesn't support versioning. So it's quite a lot of work
Everything is possible but it's quite a lot of work. But the Data should be anyway removed from scripts. If we are agree with it, let's do at least it. Then we can plan next steps. |
I rechecked the schema. In principle, it contains already all the metadata on the computed JSON data. What is it doesn't store is the JSON datafiles themselves. JSON datafiles could be stored in a serialized form in a separated table (or even database) within the BLOB field. There is MongoDB which stores JSONs as specific BSONs and respond JSONs very quickly because its specifically designed for that. We can use the same strategy to mine the analysis data and then add a layer of accepting JSONs into the database (in github actions or in server-side cron-script). We can sometimes synchronize this database with the github repository to whatever side. In the end we can get rid of github data repository at all. This strategy allow us to move to it gradually without need of refactoring the whole project. Anyway, we can separate the Data as the first step. |
It makes sense to initially create a Just for clarity, whenever I mention GUI I have this in mind. Let's plan the next steps after separating the |
What if we move to codes to another repo and keep the data in the current repo? My feeling is that then GUI would not need updates because it is just plotting the data? Or does it use some codes also? Regarding alternative storage space for data, I think that it should be a solution that is stable independently on any of us or other individual person (such as git+Zenodo version). For example, current GUI is available only as long as someones pays for the server and company running it remains active. For me the simplest next step solution would be separation to two gits. However, I understand that git may not be the best format for data in the long run. |
If we create a submodule, then Data repository should contain everything inside current I will separate submodule in my local version to check how it works. You will be able to clone my version and play with it. |
I have done it clearly within our local gitlab. Please have a look: It's now 55M. Data is a submodule. It's 370 M. Histories are fully separated. I did that using You can clone and then do
Or you can clone with the flag:
This is based on my current development branch. Not the main branch. But it doesn't matter for viewing. |
Originally posted by @pbuslaev in #201 (comment) After we separate |
Hi all,
This issue was brought up by @comcon1 here on Issue #195 and also had been keeping my mind busy for a while.
Currently we are tracking and versioning both the code (in Scripts directory) and output (Data directory) together. This has the potential to make tracing the history rather difficult in the future, plus could create headache with merge requests. Hence, I'd like to discuss how we can restructure the Databank such that we can version/track the code and data separately.
One potential solution is to use the Github submodules as @comcon1 suggested. I'm not informed enough of the submodules' capabilities.
Other potential solution is to move the data to a cloud storage and version it using some data versioning tool like
DVC
. This would mean that the data and code will be hosted at different places, might be a bit more difficult to maintain but a neater solution.Since we also have the GUI, we should be careful to not break its integration. Unfortunately I'm not well aware of how the GUI integration works, so I also would like to gather some info about it here.
@markussmiettinen , @ohsOllila any suggestions?
The text was updated successfully, but these errors were encountered: