Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PersistentDict: fixes/workarounds for #227 #228

Closed
wants to merge 3 commits into from

Conversation

matthiasdiener
Copy link
Contributor

@matthiasdiener matthiasdiener commented Jun 1, 2024

Should address issues in #227.

@matthiasdiener matthiasdiener changed the title fixes/workarounds for #227 PersistentDict: fixes/workarounds for #227 Jun 1, 2024
@matthiasdiener
Copy link
Contributor Author

Needs more testing on the beers, but otherwise ready for a first look @inducer

@matthiasdiener matthiasdiener marked this pull request as ready for review June 1, 2024 16:40
@@ -460,7 +460,8 @@ def __init__(self, identifier: str,

# isolation_level=None: enable autocommit mode
# https://www.sqlite.org/lang_transaction.html#implicit_versus_explicit_transactions
self.conn = sqlite3.connect(self.filename, isolation_level=None)
self.conn = sqlite3.connect(self.filename, isolation_level=None,
timeout=60)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the default timeout is 5s and this increases it to 60s. In a concurrent setting, that just means that it'll wait longer for some other process to let go of the lock, right?

Why does that happen at all (i.e. 5s seems like plenty of time)? Are many processes continuously writing to the cache? This mostly feels like a workaround, but I haven't debugged things, so not sure :\

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the default timeout is 5s and this increases it to 60s. In a concurrent setting, that just means that it'll wait longer for some other process to let go of the lock, right?

Yes, this is my understanding as well.

Why does that happen at all (i.e. 5s seems like plenty of time)? Are many processes continuously writing to the cache? This mostly feels like a workaround, but I haven't debugged things, so not sure :\

We aren't sure how this dictionary will be used downstream, it could be that thousands of processes are hitting the same dict at the same time. This change restores the timeout from the previous implementation:

# Exit after 60 seconds if not able to acquire lock
exit_attempts = int(60/wait_time_seconds)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't sure how this dictionary will be used downstream,

That's fair, but we mostly know how pyopencl uses it. Why does it fail there in the tests? My understanding is that the tests run with pytest-xdist with -n 4, so it shouldn't have that many concurrent accesses. Does it?

it could be that thousands of processes are hitting the same dict at the same time. This change restores the timeout from the previous implementation:

Hmm, I don't think that sqlite is meant to be that concurrent in writes. Will that work reasonably well?

Copy link
Contributor Author

@matthiasdiener matthiasdiener Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't sure how this dictionary will be used downstream,

That's fair, but we mostly know how pyopencl uses it. Why does it fail there in the tests? My understanding is that the tests run with pytest-xdist with -n 4, so it shouldn't have that many concurrent accesses. Does it?

We are still trying to figure out the reason for the slowness observed in #227. So far, this seems to only affect the beers + @inducer's laptop, which is why I didn't see this issue earlier (I had tested on my Macs, Lassen, as well as GitHub CI, which are all 2 orders of magnitude faster than the beers in these tests).

it could be that thousands of processes are hitting the same dict at the same time. This change restores the timeout from the previous implementation:

Hmm, I don't think that sqlite is meant to be that concurrent in writes. Will that work reasonably well?

It can be a matter of the right configuration, but people have reported thousands of reads/writes per second with sqlite (see e.g. https://www.reddit.com/r/golang/comments/16xswxd/comment/k34ppfo/)

Note that:

  • WAL mode doesn't seem to make a huge difference for the slow tests
  • It seems not to be a concurrency issue, the tests are slow even when running just a single test (with single writes/reads)

pytools/persistent_dict.py Outdated Show resolved Hide resolved
Co-authored-by: Alex Fikl <[email protected]>
@matthiasdiener
Copy link
Contributor Author

Closing since #231 #229 have been merged.

@matthiasdiener matthiasdiener deleted the sqlite-fixes branch June 10, 2024 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants