-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEW: Mechanism to lock profile access within AiiDA #5135
NEW: Mechanism to lock profile access within AiiDA #5135
Conversation
75872af
to
f317f65
Compare
Problem with double connections in SQLAlchemySo, right now this implementations seems to work for the Django backend, but I'm having some issues with the number of connections that are established when using SQLAlchemy. For some reason every time I SELECT pid, client_port FROM pg_stat_activity WHERE datname='PUT_YOUR_DB_NAME_HERE'; If you make just that you will see the current connection you have with psql (or none since you may run that command while connected to a different database). If you then run the In principle this seems scary because if it due to the way in which we have incorporated the SQLAlchemy engine into the code, it might mean we won't be able to do this check that is necessary for the locking and the whole mechanism is un-usable. However from the couple of tests I made I think there might be at least some aspect of this that should be solvable, since the duplication from loading the backend seems to be not from the call to In [1]: from aiida import load_profile
...: from aiida.manage.manager import get_manager
...: load_profile()
...: backend_manager = get_manager()._load_backend(schema_check=False/True) Note that technically what I need is However, it might be good to find a less patchy and more holistic solution to this. It seems like the "connections out of control" issue is not limited to this (issues #2039 also complains about double connections in daemons which I see here as well, and #4374 could be related to this, although they seem to have the problem with django?). @sphuber did you work on this Multiple connections during testsAs a side note, I get a similar problem of multiple connections when running the test suite (this in django at least, didn't try in sqlalchemy). I am assuming the tests are made in a way that a connection is established with the DB in addition to those of AiiDA in order to reset the stuff in between the tests or something like that. This however means I can't write any systematic test for this feature (hence why I included the |
SQLA pools and
|
Thanks Francisco! Thinking better to this: maybe we should do something a bit different.
Also, for a similar reason, along these lines, I would (in the force unlock method) also check what is the stored PID. |
Hey @giovannipizzi , thanks for the response!
You mean when I'm cycling iterations of test_db_sqlalch=# SELECT pid, client_port FROM pg_stat_activity WHERE datname='test_db_sqlalch';
pid | client_port
-------+-------------
96276 | 44124
579 | 45240
678 | 45278
680 | 45282
657 | 45268
682 | 45286
684 | 45290
695 | 45294
697 | 45298
698 | 45300
(10 rows)
test_db_sqlalch=# SELECT pid, client_port FROM pg_stat_activity WHERE datname='test_db_sqlalch';
pid | client_port
-------+-------------
96276 | 44124
579 | 45240
657 | 45268
703 | 45310
(4 rows)
This risk we are kind of already accepting. I mean, before locking the check returns all connections so if the DB is being accessed externally it is true that it won't lock the profile, but since the locking gets checked with AiiDA methods once the lock is established you can still access the DB externally after the lock and do modifications, so this would not be new.
Ok, so the idea is to store the
So, personally I would still prefer to warn the user of the locking nonexistent PID and have them manually unlock the profile just because this makes sure they get notified (and acknowledge) that whatever they were trying to do (maintenance operations for example) failed. I mean, even if we try to do the maintenance in a way in which failure just means having to start over and minimize risk of corruption, it is still good to make sure the user knows this was the case so they know they have to redo it. Moreover, even in the safest designed operations there are the critical steps that could cause corruption if interrupted, so it is not a bad idea for them to be a bit extra careful and check their stuff. On the other hand, how do we do the scanning of the system processes with the daemon? I found a cross-platform way of doing it using the |
The idea is that every time any process (verdi shell, jupyter notebook, daemon worker) loads the AiiDA DB, they would store their own PID there (so this would be done in the load_profile). In addition, the one that needs to lock would check the full list, and stop if there is at least another one except itself. Another option is to remain with what you have been doing, but find a way to ask SQLAlchemy (if possible) a list of all connections open in the current python process. Is this possible? |
Well, technically we don't need an unload function...we can just keep track of every process that requests access to the DB there and, with some frequency, compare this list with the PIDs of the system and just remove whatever is no longer active (kind of like your automatic unlocking, except I think it is ok here to clean non-locking process access). Not sure what is the adequate frequency for this though: doing this check every time a process loads the backend may be too much (or maybe not?) but doing this only before a lock is requested might be too little (by then the DB will have been completely poluted with IDs). Maybe always check the length of that entry and if it is, say, over 200 (or some configurable threshold) then do the cleanup?
Just to clarify, my idea was that the
I guess it should exist something like that, but SQLA really doesn't seem to like the users of the library getting their hands inside how the pools are being managed internally (or understanding how they do it for that matter...), so it is very difficult to find, and I anticipate it will be even more difficult to coordinate in how AiiDA has interfaced all that part. I'm still up for trying, but this could take significant time with uncertain results and/or I would require some help with it. |
Superseded by #5270 |
This adds a safe-guard mechanism to temporarily block access to a profile for other AiiDA entities (daemons, verdi shells, scrip execution via
verdi run
). It implements the solution described by @giovannipizzi in this comment (and so potentially superseding the respective PR #4924).The original description of the way it works is in the mentioned comment which can be checked if something I say here is unclear, but below in this OP I'll try to keep updated all the important information necessary to understand the content PR. In the comments I'll add some circumstantial information about the current situation, such as some (important) problems I'm still dealing with.
Setting the lock
The lock is set by the context manager
get_locking_context
added to theBackendManager
class. It performs the following steps in order to lock the profile in a safe way:db_dbsettings
with keylocking_pid
and value equal to the process id of the locker. This is what effectively locks the profile since all other processes will now check for this.yield
statement within atry
wrap that begins the context of the locking. After the control gets back to the process when the context is closed (finally
clause) which "deletes" thelocking_pid
setting.finally
clause.Checking the lock
This has been added to the
load_backend_environment
method of theBackendManager
class. This is supposed to be called by all processes when trying to access the database of the profile. It simply tries to get thelocking_pid
setting and exits with an error message if it finds it.All of this had to be wrapped around and optional keyword that allows to load the profile skipping this very check because there is also a new CLI tool to force the unlocking of the profile which has to be able to modify the DB even when it is locked (to remove the
locking_pid
setting):verdi profile unlock
.Other changes
The previous text describes the main modifications included in
aiida/backends/manager.py
and the addition of the command inaiida/cmdline/commands/cmd_profile.py
. A second commandverdi profile lock N
(N number of seconds to wait with the profile locked) was added for testing purposes but would probably be removed in the final version. Then the following changes were also implemented in support of the PR:aiida/common/exceptions.py
Two exceptions were added for when trying to access a locked profile (LockedProfileError) and when trying to lock a profile that is being used (LockingProfileError).
aiida/manage/manager.py
aiida/backends/djsite/manager.py
Some methods had to be adapted for enabling the option to skip the lock check mentioned above. The option needed to be passed down from when getting the backend manager to the step of loading the profile. The methods affected by this (besides
BackendManager.load_backend_environment
, where the option is used) are:Manager._load_backend
Manager.get_backend_manager
DjangoBackendManager.get_schema_generation_database
aiida/backends/utils.py
Auxiliary functions to get the current pid accessing the DB (
get_database_pid
) and listing the other connections to the DB (list_database_connections
). I didn’t put these methods in the BackendManager because they are using the backend, but since anyways are called from BackendManager there is some ugly dependency going on here that I'm not sure how to fix.aiida/backends/manager.py
Besides adapting
load_backend_environment
and addingget_locking_context
, there were also a couple of minor additions worth mentioning for completeness sake:get_locking_pid
method (used by the check inload_backend_environment
).force_unlock
method (used byverdi profile unlock
).