Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEW: Mechanism to lock profile access within AiiDA #5135

Closed

Conversation

ramirezfranciscof
Copy link
Member

This adds a safe-guard mechanism to temporarily block access to a profile for other AiiDA entities (daemons, verdi shells, scrip execution via verdi run). It implements the solution described by @giovannipizzi in this comment (and so potentially superseding the respective PR #4924).

The original description of the way it works is in the mentioned comment which can be checked if something I say here is unclear, but below in this OP I'll try to keep updated all the important information necessary to understand the content PR. In the comments I'll add some circumstantial information about the current situation, such as some (important) problems I'm still dealing with.

Setting the lock

The lock is set by the context manager get_locking_context added to the BackendManager class. It performs the following steps in order to lock the profile in a safe way:

  1. Checks for other processes connected to the DB and stops if there is any (see SQL query below)
  2. Tries to add an entry to db_dbsettings with key locking_pid and value equal to the process id of the locker. This is what effectively locks the profile since all other processes will now check for this.
  3. Now the control is returned back to the caller through a yield statement within a try wrap that begins the context of the locking. After the control gets back to the process when the context is closed (finally clause) which "deletes" the locking_pid setting.
    • Before yielding, there is a second check for other processes that may have connected simultaneously between steps 1 and 2. If there was any such processes then it raises and triggers the release of the finally clause.

Checking the lock

This has been added to the load_backend_environment method of the BackendManager class. This is supposed to be called by all processes when trying to access the database of the profile. It simply tries to get the locking_pid setting and exits with an error message if it finds it.

All of this had to be wrapped around and optional keyword that allows to load the profile skipping this very check because there is also a new CLI tool to force the unlocking of the profile which has to be able to modify the DB even when it is locked (to remove the locking_pid setting): verdi profile unlock.

Other changes

The previous text describes the main modifications included in aiida/backends/manager.py and the addition of the command in aiida/cmdline/commands/cmd_profile.py. A second command verdi profile lock N (N number of seconds to wait with the profile locked) was added for testing purposes but would probably be removed in the final version. Then the following changes were also implemented in support of the PR:

aiida/common/exceptions.py
Two exceptions were added for when trying to access a locked profile (LockedProfileError) and when trying to lock a profile that is being used (LockingProfileError).

aiida/manage/manager.py
aiida/backends/djsite/manager.py
Some methods had to be adapted for enabling the option to skip the lock check mentioned above. The option needed to be passed down from when getting the backend manager to the step of loading the profile. The methods affected by this (besides BackendManager.load_backend_environment, where the option is used) are:

  • Manager._load_backend
  • Manager.get_backend_manager
  • DjangoBackendManager.get_schema_generation_database

aiida/backends/utils.py
Auxiliary functions to get the current pid accessing the DB (get_database_pid) and listing the other connections to the DB (list_database_connections). I didn’t put these methods in the BackendManager because they are using the backend, but since anyways are called from BackendManager there is some ugly dependency going on here that I'm not sure how to fix.

aiida/backends/manager.py
Besides adapting load_backend_environment and adding get_locking_context, there were also a couple of minor additions worth mentioning for completeness sake:

  • the template text for the errors raised.
  • the global variables for the name of the key in the db_dbsettings.
  • the get_locking_pid method (used by the check in load_backend_environment).
  • the force_unlock method (used by verdi profile unlock).

@ramirezfranciscof
Copy link
Member Author

ramirezfranciscof commented Sep 14, 2021

Problem with double connections in SQLAlchemy

So, right now this implementations seems to work for the Django backend, but I'm having some issues with the number of connections that are established when using SQLAlchemy. For some reason every time I load_backend_environment two connections get established. You can check this by connecting to the DB via psql and using the SQL command found by @giovannipizzi here:

SELECT pid, client_port FROM pg_stat_activity WHERE datname='PUT_YOUR_DB_NAME_HERE';

If you make just that you will see the current connection you have with psql (or none since you may run that command while connected to a different database). If you then run the verdi profile lock N (or even open a verdi shell or verdi run something that just sleeps for a couple of seconds, if you want to try this in a different branch) and run that sql query again, you should see two processes being connected.

In principle this seems scary because if it due to the way in which we have incorporated the SQLAlchemy engine into the code, it might mean we won't be able to do this check that is necessary for the locking and the whole mechanism is un-usable. However from the couple of tests I made I think there might be at least some aspect of this that should be solvable, since the duplication from loading the backend seems to be not from the call to _load_backend_environment, but actually from validate_schema. This can be checked by doing the following in an ipython shell while checking the DB with the previous psql command:

In [1]: from aiida import load_profile
   ...: from aiida.manage.manager import get_manager
   ...: load_profile()
   ...: backend_manager = get_manager()._load_backend(schema_check=False/True)

Note that technically what I need is get_manager().get_backend_manager(), which in turn calls _load_backend, but get_backend_manager does not have the schema_check option to pass down, although it could be easily added. In principle this solution could make this feature viable.

However, it might be good to find a less patchy and more holistic solution to this. It seems like the "connections out of control" issue is not limited to this (issues #2039 also complains about double connections in daemons which I see here as well, and #4374 could be related to this, although they seem to have the problem with django?). @sphuber did you work on this validate_schema method when you added the "generation" thing to the DB? Do you have any idea what might be happening here? I also made some extra tests and found some peculiarities which I will include in a separate comment, but the whole way in which the backends are implemented and the different paths to modify things in the DB is still very confusing to me so I think I would appreciate some guidance for any modification that goes deeper than this.

Multiple connections during tests

As a side note, I get a similar problem of multiple connections when running the test suite (this in django at least, didn't try in sqlalchemy). I am assuming the tests are made in a way that a connection is established with the DB in addition to those of AiiDA in order to reset the stuff in between the tests or something like that. This however means I can't write any systematic test for this feature (hence why I included the verdi profile lock N instead of just adding automatic tests). I'm not sure how or where this background connections is established/managed and so what would be the best way to circumvent it for testing this.

@ramirezfranciscof
Copy link
Member Author

SQLA pools and dispose()

Another thing I tried was using the dispose() of the SQLA Engine. Apparently one possible problem is that SLQA manages the connections in pools and it doesn't like it when you try to have more individual control on these, but this method may be able to close them down (see here).

We call this method in reset_session() of aiida/backends/ANYBACKEND/__init__.py (not sure why we have this duplicated in each backend). From doing some experiments with this I managed to get some more extra info on the behaviour of validate_schema. All of these were run in an ipython shell

Consider the following commands:

In [1]: from aiida import load_profile
   ...: from aiida.manage.manager import get_manager
   ...: 
   ...: load_profile()
   ...: manager = get_manager()
   ...: profile = manager.get_profile()

In [2]: backend_manager = manager.get_backend_manager()

In [3]: backend_manager.load_backend_environment(profile)

In [4]: reset_session()

If I start repeating [3] (before running [4]) then nothing happens, no new sessions get created only the initial 2. If I however start intercalating [3] and [4], some sessions get created and destroyed at random, with a net accumulation:

> SELECT pid, client_port FROM pg_stat_activity WHERE datname='test_db_sqlalch'; # before [2]

  pid  | client_port 
-------+-------------
 96276 |       44124
(1 row)

> SELECT pid, client_port FROM pg_stat_activity WHERE datname='test_db_sqlalch'; # after [2]

  pid  | client_port 
-------+-------------
 96276 |       44124
 98989 |       44854
 98990 |       44856
(3 rows)

> SELECT pid, client_port FROM pg_stat_activity WHERE datname='test_db_sqlalch'; # After many repetitions of [3] and [4]

  pid  | client_port 
-------+-------------
 96276 |       44124
 98989 |       44854 # This one stayed the same...
 99069 |       44880
 99367 |       44966
 99365 |       44962
 99368 |       44968
(6 rows)

Something similar happens if I replace [3] with backend_manager.load_backend_environment(profile, validate_schema=False)

If I comment out the schema validation I get a single process connected with [2], but then a second process is connected with [3], and successive intercalations of [3] and [4] have the same effect of connecting new processes without reliably closing the older connections, and with the process connected by [2] always "surviving".

I just wanted to add these in case they help spark any idea of what is going on.

@giovannipizzi
Copy link
Member

Thanks Francisco!
Indeed I didn't think to the fact that SQLAlchemy also pools connections, so it will indeed open more than one connection to the DB. If it's the pooling, it's not really leaking, it's done for performance, but it should stop. Does the number of connections keeps increasing or it stops at some value? (The default pool size is 5 I think - but I see you already have 6? I didn't check if we specify a custom value for the number of pools, or maybe that's not the only issue).

Thinking better to this: maybe we should do something a bit different.

  • We accept the risk that someone connects to the PSQL DB out of AiiDA - we can't control this
  • We store in the AiiDA DbSettings table the python process PID rather than the PSQL PID obtained from a query
  • We are OK if the process PID is the current one (so this should work independently of how many connection pools are opened, but other python processes will stop). Note that this will prevent running a daemon with more than one worker, but I think that's OK (the goal is locking for maintenance operations, not for normal usage)

Also, for a similar reason, along these lines, I would (in the force unlock method) also check what is the stored PID.
If the PID does not exist anymore (the process died, or the machine rebooted, ...) I think it's safe to just silently remove the lock. Similarly, I would do the same if, when loading, we find a lock but the PID does not exist anymore.
Instead, we should continue asking explicitly what to do if the PID is still running (and in principle we should really discourage using it - do we have a usecase when this unlocking is needed? Thinking better, if we do a silent unlock if the PID does not exist anymore, probably we don't need to expose the functionality on the verdi command line - we just keep this at the Python API, and document not to use them unless one knows what to do). This is similarly to what we do with the daemon PID - if we stop the daemon but the PID does not exist we just clear it, and then we can start; we don't offer a functionality to clear it. What do you think?

@ramirezfranciscof
Copy link
Member Author

Hey @giovannipizzi , thanks for the response!

Thanks Francisco!
Indeed I didn't think to the fact that SQLAlchemy also pools connections, so it will indeed open more than one connection to the DB. If it's the pooling, it's not really leaking, it's done for performance, but it should stop. Does the number of connections keeps increasing or it stops at some value? (The default pool size is 5 I think - but I see you already have 6? I didn't check if we specify a custom value for the number of pools, or maybe that's not the only issue).

You mean when I'm cycling iterations of reset_session and load_backend_environment? Yes, indeed at some point some of the processes get cleared, but not all and not reliably, and as you note it does get to exceed 5. For example this is between one of those iterations:

test_db_sqlalch=# SELECT pid, client_port FROM pg_stat_activity WHERE datname='test_db_sqlalch';
  pid  | client_port 
-------+-------------
 96276 |       44124
   579 |       45240
   678 |       45278
   680 |       45282
   657 |       45268
   682 |       45286
   684 |       45290
   695 |       45294
   697 |       45298
   698 |       45300
(10 rows)

test_db_sqlalch=# SELECT pid, client_port FROM pg_stat_activity WHERE datname='test_db_sqlalch';
  pid  | client_port 
-------+-------------
 96276 |       44124
   579 |       45240
   657 |       45268
   703 |       45310
(4 rows)

Thinking better to this: maybe we should do something a bit different.

  • We accept the risk that someone connects to the PSQL DB out of AiiDA - we can't control this

This risk we are kind of already accepting. I mean, before locking the check returns all connections so if the DB is being accessed externally it is true that it won't lock the profile, but since the locking gets checked with AiiDA methods once the lock is established you can still access the DB externally after the lock and do modifications, so this would not be new.

  • We store in the AiiDA DbSettings table the python process PID rather than the PSQL PID obtained from a query

  • We are OK if the process PID is the current one (so this should work independently of how many connection pools are opened, but other python processes will stop). Note that this will prevent running a daemon with more than one worker, but I think that's OK (the goal is locking for maintenance operations, not for normal usage)

Ok, so the idea is to store the os.getpid() in the DB and check coincidence with this when loading a profile, so far so good. However, I'm not sure then how do we check if there are other processes connected to the database before setting up the lock? (meaning, other AiiDA processes, as if for example the user left another verdi shell open).

Also, for a similar reason, along these lines, I would (in the force unlock method) also check what is the stored PID.
If the PID does not exist anymore (the process died, or the machine rebooted, ...) I think it's safe to just silently remove the lock. Similarly, I would do the same if, when loading, we find a lock but the PID does not exist anymore.
Instead, we should continue asking explicitly what to do if the PID is still running (and in principle we should really discourage using it - do we have a usecase when this unlocking is needed? Thinking better, if we do a silent unlock if the PID does not exist anymore, probably we don't need to expose the functionality on the verdi command line - we just keep this at the Python API, and document not to use them unless one knows what to do). This is similarly to what we do with the daemon PID - if we stop the daemon but the PID does not exist we just clear it, and then we can start; we don't offer a functionality to clear it. What do you think?

So, personally I would still prefer to warn the user of the locking nonexistent PID and have them manually unlock the profile just because this makes sure they get notified (and acknowledge) that whatever they were trying to do (maintenance operations for example) failed. I mean, even if we try to do the maintenance in a way in which failure just means having to start over and minimize risk of corruption, it is still good to make sure the user knows this was the case so they know they have to redo it. Moreover, even in the safest designed operations there are the critical steps that could cause corruption if interrupted, so it is not a bad idea for them to be a bit extra careful and check their stuff.

On the other hand, how do we do the scanning of the system processes with the daemon? I found a cross-platform way of doing it using the psutil python package, but if we already have a way of doing this maybe there is no need to add an extra dependency.

@giovannipizzi
Copy link
Member

I'm not sure then how do we check if there are other processes connected to the database before setting up the lock? (meaning, other AiiDA processes, as if for example the user left another verdi shell open).

The idea is that every time any process (verdi shell, jupyter notebook, daemon worker) loads the AiiDA DB, they would store their own PID there (so this would be done in the load_profile). In addition, the one that needs to lock would check the full list, and stop if there is at least another one except itself.
The issue I realise now is that there is no "unload" function... Maybe/probably we have some singleton variable representing the profile, and one could add the deletion to del of that class (even if it's a bit complex - very little should be done there, and we would instead need to connect to a DB...).
We could instead rely on automatic unlocking as I mentioned - I still think that the user should not even a way to unlock - it's much more risky, as if the maintenance process takes time, typically a user gets bored of waiting and just kills it. Instead at startup one could check with psutil (already a dependency) if the processes exist and clean the non-existing ones. Not perfect (the same PID could be taken by some other process) but e.g. one could also store the string of the executable and clean if the process exists but the string has changed.

Another option is to remain with what you have been doing, but find a way to ask SQLAlchemy (if possible) a list of all connections open in the current python process. Is this possible?

@ramirezfranciscof
Copy link
Member Author

I'm not sure then how do we check if there are other processes connected to the database before setting up the lock? (meaning, other AiiDA processes, as if for example the user left another verdi shell open).

The idea is that every time any process (verdi shell, jupyter notebook, daemon worker) loads the AiiDA DB, they would store their own PID there (so this would be done in the load_profile). In addition, the one that needs to lock would check the full list, and stop if there is at least another one except itself.
The issue I realise now is that there is no "unload" function... Maybe/probably we have some singleton variable representing the profile, and one could add the deletion to del of that class (even if it's a bit complex - very little should be done there, and we would instead need to connect to a DB...).

Well, technically we don't need an unload function...we can just keep track of every process that requests access to the DB there and, with some frequency, compare this list with the PIDs of the system and just remove whatever is no longer active (kind of like your automatic unlocking, except I think it is ok here to clean non-locking process access). Not sure what is the adequate frequency for this though: doing this check every time a process loads the backend may be too much (or maybe not?) but doing this only before a lock is requested might be too little (by then the DB will have been completely poluted with IDs). Maybe always check the length of that entry and if it is, say, over 200 (or some configurable threshold) then do the cleanup?

We could instead rely on automatic unlocking as I mentioned - I still think that the user should not even a way to unlock - it's much more risky, as if the maintenance process takes time, typically a user gets bored of waiting and just kills it. Instead at startup one could check with psutil (already a dependency) if the processes exist and clean the non-existing ones. Not perfect (the same PID could be taken by some other process) but e.g. one could also store the string of the executable and clean if the process exists but the string has changed.

Just to clarify, my idea was that the verdi profile unlock only works if the process has died. If the process is still running but the user anyways want to abort, they will have to first manually kill the process (with top or something like that) and only then run verdi profile unlock. So for me the unlock is something extra to put in the way of the user so that he is sure and aware of what they are doing, it should not make anything "easier".

Another option is to remain with what you have been doing, but find a way to ask SQLAlchemy (if possible) a list of all connections open in the current python process. Is this possible?

I guess it should exist something like that, but SQLA really doesn't seem to like the users of the library getting their hands inside how the pools are being managed internally (or understanding how they do it for that matter...), so it is very difficult to find, and I anticipate it will be even more difficult to coordinate in how AiiDA has interfaced all that part. I'm still up for trying, but this could take significant time with uncertain results and/or I would require some help with it.

@ramirezfranciscof ramirezfranciscof mentioned this pull request Sep 22, 2021
2 tasks
@sphuber
Copy link
Contributor

sphuber commented Jan 21, 2022

Superseded by #5270

@sphuber sphuber closed this Jan 21, 2022
@ramirezfranciscof ramirezfranciscof deleted the profile_lock branch February 25, 2022 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants