Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache management features #799

Merged
merged 9 commits into from
Aug 22, 2024
102 changes: 102 additions & 0 deletions docs/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,108 @@ and tracing upwards through is_a and part_of relationships:

uberon viz -p i,p hand foot

Cache Control
-------------

OAK may download data from remote sources as part of its normal operations. For
example, using the :code:`sqlite:obo:...` input selector will cause OAK to
fetch the requested Semantic-SQL database from a centralised repository.
Whenever that happens, the downloaded data will be cached in a local directory
so that subsequent commands using the same input selector do not have to
download the file again.

By default, OAK will refresh (download again) a previously downloaded file if
it was last downloaded more than 7 days ago.

The behavior of the cache can be controlled in two ways: with an option on the
command line and with a configuration file.

Controlling the cache on the command line
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The global option :code:`--caching` gives the user some control on how the
cache works.

To change the default cache expiry lifetime of 7 days, the :code:`--caching`
option accepts a value of the form :code:`ND`, where *N* is a positive integer
and *D* can be either :code:`s`, :code:`d`, :code:`w`, :code:`m`, or :code:`y`
to indicate that *N* is a number of seconds, days, weeks, months, or years,
respectively. If the *D* part is omitted, it defaults to :code:`d`.

For example, :code:`--caching=3w` instructs OAK to refresh a cached file if it
was last refreshed 21 days ago.

The :code:`--caching` option also accepts the following special values:

- :code:`refresh` to force OAK to always refresh a file regardless of its age;
- :code:`no-refresh` to do the opposite, that is, preventing OAK from
refreshing a file regardless of its age;
- :code:`clear` to forcefully clear the cache (which will trigger a refresh as
a consequence);
- :code:`reset` is a synonym of :code:`clear`.

Note the difference between :code:`refresh` and :code:`clear`. The former will
only cause the requested file to be refreshed, leaving any other file that may
exist in the cache untouched. The latter will delete all cached files, so that
not only the requested file will be downloaded again, but any other
previously cached file will also have to be downloaded again the next time they
are requested.

In both case, refreshing and clearing will only happen if the OAK command in
which the :code:`--caching` option is used attempts to look up a cached file.
Otherwise the option will have no effect.

To forcefully clear the cache independently of any command, the
:ref:`cache-clear` command may be used. The contents of the cache may be
explored at any time with the :ref:`cache-ls` command.

Controlling the cache with a configuration file
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Finer control of how the cache works is possible through a configuration file
that OAK will look up for at the following locations:

- under GNU/Linux: in ``$XDG_CONFIG_HOME/ontology-access-kit/cache.conf``;
- under macOS: in ``$HOME/Library/Application Support/ontology-access-kit/cache.conf``;
- under Windows: in ``%LOCALAPPDATA%\ontology-access-kit\ontology-access-kit\cache.conf``.

The file should contain lines of the form :code:`pattern = policy`, where:

- *pattern* is a shell-type globbing pattern indicating the files that will be
concerned by the policy set forth on the line;
- *policy* is the same type of value as expected by the :code:`--caching`
option as explained in the previous section.

Blank lines and lines starting with :code:`#` are ignored.

If the *pattern* is :code:`default` (or :code:`*`), the corresponding policy
will be used for any cached file that does not have a matching policy.

Here is a sample configuration file:

.. code-block::

# Uberon will be refreshed if older than 1 month
uberon.db = 1m
# FBbt will be refreshed if older than 2 weeks
fbbt.db = 2w
# Other FlyBase ontologies will be refreshed if older than 2 months
fb*.db = 2m
# All other files will be refreshed if older than 3 weeks
default = 3w

Note that when looking up the policy to apply to a given file, patterns are
tried in the order they appear in the file. This is why the :code:`fbbt.db`
pattern in the example above must be listed *before* the less specific
:code:`fb*.db` pattern, otherwise it would be ignored. (This does not apply to
the default pattern -- whether it is specified as :code:`default` or as
:code:`*` -- which is always tried after all the other patterns.)

The :code:`--caching` option described in the previous section always takes
precedence over the configuration file. That is, all rules set forth in the
configuration will be ignored if the :code:`--caching` option is specified on
the command line.

Commands
-----------

Expand Down
4 changes: 4 additions & 0 deletions docs/intro/tutorial07.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,10 @@ This will download the pato.db sqlite file once, and cache it.

PyStow is used to cache the file, and the default location is ``~/.data/oaklib``.

By default, a cached SQLite file will be automatically refreshed (downloaded
again) if it is older than 7 days. For details on how to alter the behavior of
the cache, see the :ref:`Cache Control` section in the CLI documentation.

Building your own SQLite files
-------------------

Expand Down
38 changes: 19 additions & 19 deletions src/oaklib/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,12 @@
# See https://stackoverflow.com/questions/47972638/how-can-i-define-the-order-of-click-sub-commands-in-help
import json
import logging
import os
import statistics as stats
import sys
from collections import defaultdict
from enum import Enum, unique
from itertools import chain
from pathlib import Path
from time import time
from types import ModuleType
from typing import (
Any,
Expand All @@ -28,7 +26,6 @@

import click
import kgcl_schema.grammar.parser as kgcl_parser
import pystow
import sssom.writers as sssom_writers
import sssom_schema
import yaml
Expand All @@ -42,6 +39,7 @@

import oaklib.datamodels.taxon_constraints as tcdm
from oaklib import datamodels
from oaklib.constants import FILE_CACHE
from oaklib.converters.logical_definition_flattener import LogicalDefinitionFlattener
from oaklib.datamodels import synonymizer_datamodel
from oaklib.datamodels.association import RollupGroup
Expand Down Expand Up @@ -149,6 +147,7 @@
generate_disjoint_class_expressions_axioms,
)
from oaklib.utilities.basic_utils import pairs_as_dict
from oaklib.utilities.caching import CachePolicy
from oaklib.utilities.iterator_utils import chunk
from oaklib.utilities.kgcl_utilities import (
generate_change_id,
Expand Down Expand Up @@ -568,6 +567,11 @@ def _apply_changes(impl, changes: List[kgcl.Change]):
show_default=True,
help="If set, will profile the command",
)
@click.option(
"--caching",
type=CachePolicy.ClickType,
help="Set the cache management policy",
)
def main(
verbose: int,
quiet: bool,
Expand All @@ -587,6 +591,7 @@ def main(
prefix,
profile: bool,
import_depth: Optional[int],
caching: Optional[CachePolicy],
**kwargs,
):
"""
Expand Down Expand Up @@ -635,6 +640,7 @@ def exit():
import requests_cache

requests_cache.install_cache(requests_cache_db)
FILE_CACHE.force_policy(caching)
resource = OntologyResource()
resource.slug = input
settings.autosave = autosave
Expand Down Expand Up @@ -5454,12 +5460,14 @@ def cache_ls():
"""
List the contents of the pystow oaklib cache.

TODO: this currently only works on unix-based systems.
"""
directory = pystow.api.join("oaklib")
command = f"ls -al {directory}"
click.secho(f"[pystow] {command}", fg="cyan", bold=True)
os.system(command) # noqa:S605
units = ["B", "KB", "MB", "GB", "TB"]
for path, size, mtime in FILE_CACHE.get_contents(subdirs=True):
i = 0
while size > 1024 and i < len(units) - 1:
size /= 1024
i += 1
click.echo(f"{path} ({size:.2f} {units[i]}, {mtime:%Y-%m-%d})")


@main.command()
Expand All @@ -5475,17 +5483,9 @@ def cache_clear(days_old: int):
Clear the contents of the pystow oaklib cache.

"""
directory = pystow.api.join("oaklib")
now = time()
for item in Path(directory).glob("*"):
if ".db" not in str(item):
continue
mtime = item.stat().st_mtime
curr_days_old = (int(now) - int(mtime)) / 86400
logging.info(f"{item} is {curr_days_old}")
if curr_days_old > days_old:
click.echo(f"Deleting {item} which is {curr_days_old}")
item.unlink()

for name, _, age in FILE_CACHE.clear(subdirs=False, older_than=days_old, pattern="*.db*"):
click.echo(f"Deleted {name} which was {age.days} days old")


@main.command()
Expand Down
4 changes: 4 additions & 0 deletions src/oaklib/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,13 @@

import pystow

from oaklib.utilities.caching import FileCache

__all__ = [
"OAKLIB_MODULE",
"FILE_CACHE",
]

OAKLIB_MODULE = pystow.module("oaklib")
FILE_CACHE = FileCache(OAKLIB_MODULE)
TIMEOUT_SECONDS = 30
4 changes: 2 additions & 2 deletions src/oaklib/implementations/llm_implementation.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
from dataclasses import dataclass
from typing import TYPE_CHECKING, Dict, Iterable, Iterator, List, Optional, Tuple

import pystow
from linkml_runtime.dumpers import yaml_dumper
from sssom_schema import Mapping
from tenacity import (
Expand All @@ -19,6 +18,7 @@
)

from oaklib import BasicOntologyInterface
from oaklib.constants import FILE_CACHE
from oaklib.datamodels.class_enrichment import ClassEnrichmentResult
from oaklib.datamodels.item_list import ItemList
from oaklib.datamodels.obograph import DefinitionPropertyValue
Expand Down Expand Up @@ -148,7 +148,7 @@ def config_to_prompt(configuration: Optional[ValidationConfiguration]) -> Option

for obj in configuration.documentation_objects:
if obj.startswith("http:") or obj.startswith("https:"):
path = pystow.ensure("oaklib", "documents", url=obj)
path = FILE_CACHE.ensure("documents", url=obj)
else:
path = obj
with open(path) as f:
Expand Down
4 changes: 2 additions & 2 deletions src/oaklib/implementations/sqldb/sql_implementation.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@

import oaklib.datamodels.ontology_metadata as om
import oaklib.datamodels.validation_datamodel as vdm
from oaklib.constants import OAKLIB_MODULE
from oaklib.constants import FILE_CACHE
from oaklib.datamodels import obograph, ontology_metadata
from oaklib.datamodels.association import Association
from oaklib.datamodels.obograph import (
Expand Down Expand Up @@ -342,7 +342,7 @@ def __post_init__(self):
# Option 1 uses direct URL construction:
url = f"https://s3.amazonaws.com/bbop-sqlite/{prefix}.db.gz"
logging.info(f"Ensuring gunzipped for {url}")
db_path = OAKLIB_MODULE.ensure_gunzip(url=url, autoclean=False)
db_path = FILE_CACHE.ensure_gunzip(url=url, autoclean=False)
# Option 2 uses botocore to interface with the S3 API directly:
# db_path = OAKLIB_MODULE.ensure_from_s3(s3_bucket="bbop-sqlite", s3_key=f"{prefix}.db")
locator = f"sqlite:///{db_path}"
Expand Down
Loading
Loading