Occasional failures in `get_records_with_cache` #844

ntBre · 2024-06-25T21:49:19Z

Describe the bug
I was following up on our discussion from the meeting today of trying to use qcportal.cache.get_records_with_cache to cache record downloads outside of datasets, and I noticed that I was occasionally getting failures with a very simple script like:

import shutil
from pathlib import Path

from qcportal import PortalClient
from qcportal.cache import RecordCache, get_records_with_cache
from qcportal.optimization import OptimizationRecord

addr = "https://api.qcarchive.molssi.org:443/"
cache_dir = Path("api.qcarchive.molssi.org_443")

if cache_dir.exists():
    shutil.rmtree(cache_dir)

client = PortalClient(addr, cache_dir=".")
record_cache = RecordCache(f"{client.cache.cache_dir}/records.sqlite", False)
r1 = get_records_with_cache(
    client, record_cache, OptimizationRecord, [137149103]
)
r2 = get_records_with_cache(
    None, record_cache, OptimizationRecord, [137149103]
)

This often leads to the error below:

Traceback (most recent call last):
  File "/home/brent/omsf/scratch/qcportal-cache/simple.py", line 19, in <module>
    get_records_with_cache(
  File "/home/brent/mambaforge/envs/qcsubmit-test-basic/lib/python3.11/site-packages/qcportal/cache.py", line 653, in get_records_with_cache
    raise RuntimeError("Need to fetch some records, but not connected to a client")
RuntimeError: Need to fetch some records, but not connected to a client

but if I perturb the script slightly (such as by deleting the r1 and r2 assignments but keeping the function calls), it will run successfully. I think this means that there is some kind of timing/concurrency issue with how the records are being committed to the database, but adding a time.sleep call between the two didn't help, so that might be totally off base.

I also tried running this code in a loop to try to see how often it failed:

import shutil
from pathlib import Path

from qcportal import PortalClient
from qcportal.cache import RecordCache, get_records_with_cache
from qcportal.optimization import OptimizationRecord

addr = "https://api.qcarchive.molssi.org:443/"
cache_dir = Path("api.qcarchive.molssi.org_443")

if cache_dir.exists():
    shutil.rmtree(cache_dir)

failed = 0
for i in range(100):
    client = PortalClient(addr, cache_dir=".")
    record_cache = RecordCache(
        f"{client.cache.cache_dir}/records.sqlite", False
    )
    r1 = get_records_with_cache(
        client, record_cache, OptimizationRecord, [137149103]
    )
    try:
        r2 = get_records_with_cache(
            None, record_cache, OptimizationRecord, [137149103]
        )
    except RuntimeError as e:
        assert "Need to fetch some records" in str(e)
        print(f"failed on iter {i}")
        failed += 1
        continue
    assert r1 == r2, f"mismatch on iter {i}"

print(failed)

But as written it always fails on the first iteration and then successfully accesses the existing cache on the other 99 iterations. However, if I move the rmtree call inside the loop, I get 100 instances of this error:

Traceback (most recent call last):
  File "/home/brent/mambaforge/envs/qcsubmit-test-basic/lib/python3.11/site-packages/qcportal/record_models.py", line 437, in __del__
    self.sync_to_cache(True)  # Don't really *have* to detach, but why not
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brent/mambaforge/envs/qcsubmit-test-basic/lib/python3.11/site-packages/qcportal/record_models.py", line 523, in sync_to_cache
    self._record_cache.writeback_record(self)
  File "/home/brent/mambaforge/envs/qcsubmit-test-basic/lib/python3.11/site-packages/qcportal/cache.py", line 169, in writeback_record
    self._conn.execute(stmt, row_data)
  File "src/cursor.c", line 169, in resetcursor
apsw.ReadOnlyError: ReadOnlyError: attempt to write a readonly database

To Reproduce
See either snippet above. I sent the first one to Jeff (@j-wags) and he could also reproduce it, so it's not just my machine at least.

Expected behavior
I expect both calls to get_records_with_cache to return the same record without throwing an exception. The first call should populate the cache and the second should read from it.

Additional context
This is on the most recent version of qcportal, 0.55. I can upload my full conda environment if needed.

The text was updated successfully, but these errors were encountered:

bennybp · 2024-07-08T14:26:39Z

I think this is fixed in the current main branch. Originally, records would be written to the cache when they get destructed/garbage collected. It's a cute idea, but results in a whole host of problems. So that was removed in PR #843, resulting in more expected behavior.

ntBre mentioned this issue Jun 26, 2024

Use get_records_with_cache to cache to_records calls openforcefield/openff-qcsubmit#286

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional failures in `get_records_with_cache` #844

Occasional failures in `get_records_with_cache` #844

ntBre commented Jun 25, 2024 •

edited

Loading

bennybp commented Jul 8, 2024 •

edited

Loading

Occasional failures in get_records_with_cache #844

Occasional failures in get_records_with_cache #844

Comments

ntBre commented Jun 25, 2024 • edited Loading

bennybp commented Jul 8, 2024 • edited Loading

Occasional failures in `get_records_with_cache` #844

Occasional failures in `get_records_with_cache` #844

ntBre commented Jun 25, 2024 •

edited

Loading

bennybp commented Jul 8, 2024 •

edited

Loading