Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructuring the repo #200

Open
batukav opened this issue Aug 16, 2024 · 9 comments
Open

Restructuring the repo #200

batukav opened this issue Aug 16, 2024 · 9 comments

Comments

@batukav
Copy link
Collaborator

batukav commented Aug 16, 2024

Hi all,

This issue was brought up by @comcon1 here on Issue #195 and also had been keeping my mind busy for a while.

Currently we are tracking and versioning both the code (in Scripts directory) and output (Data directory) together. This has the potential to make tracing the history rather difficult in the future, plus could create headache with merge requests. Hence, I'd like to discuss how we can restructure the Databank such that we can version/track the code and data separately.

One potential solution is to use the Github submodules as @comcon1 suggested. I'm not informed enough of the submodules' capabilities.

Other potential solution is to move the data to a cloud storage and version it using some data versioning tool like DVC. This would mean that the data and code will be hosted at different places, might be a bit more difficult to maintain but a neater solution.

Since we also have the GUI, we should be careful to not break its integration. Unfortunately I'm not well aware of how the GUI integration works, so I also would like to gather some info about it here.

@markussmiettinen , @ohsOllila any suggestions?

@comcon1
Copy link
Contributor

comcon1 commented Aug 16, 2024

Update of the DB at GUI server is maintained by the special script, which now run by cron. Having a submodule Data, you can update Data part independently with git submodule update and the folder structure will be the same. Actually, at the server-side we will need to cron only updating Data part, whereas the Script part we can update manually (even better). You can make a toy repository to check how it works.

If we move Data to another repository, we will also have to make some changes in GUI part, when they recalculate paths of Git-stored files at client-level (when plotting graphs, for example). But that we can schedule with @fsuarezleston. We can first copy Data to the other repo. Then wait for fixes, and only then remove Data from the main repo.

Having Data in cloud is also a temporary solution. I don't see much difference from using GitHub or GitLab versioning against DVC. But I never used DVC, so I don't have a very strong opinion. But I suppose that solid solution - is to store data in the database and access it via some service. I suppose, repositories like PDB are organized like that. Noone stores this kind of data just as versioned files. But that is too hard for a pet-project - that should be outsourced to engineers after getting some funding for the entire project.

But removing data from the code repository is anyway an inevitable step on a way to the solid solution.

One thing we should agree on: where do we store auxiliary files like mappings and UA-jsons. I suppose they should be in the code-part. And regarding info files - I would just remove them. I don't see any profit of storing info-s.

@batukav
Copy link
Collaborator Author

batukav commented Aug 18, 2024

just out of curiosity: since the DB for the GUI hosted somewhere, can't we also store the data in the same server?

But I suppose that solid solution - is to store data in the database and access it via some service. I suppose, repositories like PDB are organized like that. Noone stores this kind of data just as versioned files. But that is too hard for a pet-project - that should be outsourced to engineers after getting some funding for the entire project.

Same question as above: as we already have the DB that serves the GUI, can't we simply build an CLI/API around it?

@comcon1
Copy link
Contributor

comcon1 commented Aug 18, 2024

It's not a problem to host database somewhere, and on this server as well. The problem is that this database was build just for GUI presentation. It doesn't contain complete data from Data/simulations subfolder. It contains some links and when graphs are plotted, they are downloaded on JS side directly from github. You can see everything connecting Databank and GUI part here: https://github.com/comcon1/Databank/tree/modularize-r2/Scripts/updateGUI
It's a branch, I'm actually working on now. But the GUI-part almost doesn't change.

There I put also the DB schema in the SQL file, so you can see how the tables are organized.

Current database also doesn't support versioning. So it's quite a lot of work

  • to make this DB
  • make CLI interface around it
  • make NMRlipids' scripts working with it
  • make GUI work with it
  • make reserve copying of the DB on the third-party side.

Everything is possible but it's quite a lot of work. But the Data should be anyway removed from scripts. If we are agree with it, let's do at least it. Then we can plan next steps.

@comcon1
Copy link
Contributor

comcon1 commented Aug 20, 2024

I rechecked the schema. In principle, it contains already all the metadata on the computed JSON data. What is it doesn't store is the JSON datafiles themselves. JSON datafiles could be stored in a serialized form in a separated table (or even database) within the BLOB field. There is MongoDB which stores JSONs as specific BSONs and respond JSONs very quickly because its specifically designed for that.

We can use the same strategy to mine the analysis data and then add a layer of accepting JSONs into the database (in github actions or in server-side cron-script). We can sometimes synchronize this database with the github repository to whatever side. In the end we can get rid of github data repository at all. This strategy allow us to move to it gradually without need of refactoring the whole project. Anyway, we can separate the Data as the first step.

@batukav
Copy link
Collaborator Author

batukav commented Aug 20, 2024

It makes sense to initially create a submodule for theData. I'm okay with doing this if nobody objects and can handle it in the upcoming days.

Just for clarity, whenever I mention GUI I have this in mind.

Let's plan the next steps after separating the Data and updating everything with the results from Issue #195

@ohsOllila
Copy link
Member

If we move Data to another repository, we will also have to make some changes in GUI part, when they recalculate paths of Git->stored files at client-level (when plotting graphs, for example).

What if we move to codes to another repo and keep the data in the current repo? My feeling is that then GUI would not need updates because it is just plotting the data? Or does it use some codes also?

Regarding alternative storage space for data, I think that it should be a solution that is stable independently on any of us or other individual person (such as git+Zenodo version). For example, current GUI is available only as long as someones pays for the server and company running it remains active. For me the simplest next step solution would be separation to two gits. However, I understand that git may not be the best format for data in the long run.

@comcon1
Copy link
Contributor

comcon1 commented Aug 20, 2024

If we create a submodule, then Data repository should contain everything inside current Data folder. So the paths will be broken anyway. So it will require to be fixed in the configs of GUI code. If the path is hardcoded in GUI, it's anyway bad and should be fixed.

I will separate submodule in my local version to check how it works. You will be able to clone my version and play with it.

@comcon1
Copy link
Contributor

comcon1 commented Aug 21, 2024

I have done it clearly within our local gitlab. Please have a look:
https://git.app.uib.no/Aleksei.Nesterenko/Databank

It's now 55M. Data is a submodule. It's 370 M. Histories are fully separated. I did that using git-filter-repo utility, which rewrites history.

You can clone and then do

git submodule update --init --recursive

Or you can clone with the flag:

git clone --recurse-submodules https://path-to-repo

This is based on my current development branch. Not the main branch. But it doesn't matter for viewing.

@comcon1
Copy link
Contributor

comcon1 commented Dec 16, 2024

Out of curiosity, is it maybe worth storing all the package dictionaries as json or yaml files and loading those? While there is nothing wrong with having those dictionaries initialized as now, I wonder if it is a bit cleaner to store them as files. In principle, this can provide additional functionality to the users, since they can modify (add/change) these files, without changing the package. The potential use case, can be testing new lipids for example

Originally posted by @pbuslaev in #201 (comment)

After we separate Data we can think about getting rid of lipids dict at all (from the code part). All information about molecules which users add should be in principle stored outside the code part IMHO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants