Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial make auto cache plugin #2912

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dansola
Copy link
Contributor

@dansola dansola commented Nov 7, 2024

Why are the changes needed?

Make caching easier to use in flytekit by reducing cognitive burden of specifying cache versions

What changes were proposed in this pull request?

AutoCache protocol added to flytekit with a salt argument and a get_version method. The task function is passed to get_version in the task decorator to determine a cache version. Then the new cache version follows the same path a user created cache version would.

auto_cache plugin created which will contain implementations of AutoCache. The first one is CacheFunctionBody which just checks the function body and ignores formatting and comments.

How was this patch tested?

Unit tests added which verify versions are consistent and change when we expect. Since changes to the function name will cause a different hash, we move dummy functions to a separate directory and import them so we can keep the name the same but test that the hash changes with the contents change.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

@dansola
Copy link
Contributor Author

dansola commented Nov 7, 2024

@eapolinario @cosmicBboy Here is a draft PR doing roughly what we talked about. Would be great to get your opinions before I implement more hashing methods. A couple questions:

  1. Is flytekit/core/task.py the right place to hash the task function body? i.e in the decorator right before we pass the cache version to TaskMetadata.
  2. Do we still like the idea of using protocols and having a different implantation for different things like the function body, imports, imagespec, external packages, etc.? Right now cache will take a list[AutoCache] and we combine the hashes (we can hash them later so the string isn't long), but is the most user friendly as compared to some dataclass config that has a series of boolean arguments for example?

Given a function, generates a version hash based on its source code and the salt.
"""

def __init__(self, salt: str = "salt") -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can the default value be ""?

Comment on lines +348 to +349
cache_versions = [item.get_version() for item in cache]
task_hash = "".join(cache_versions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eapolinario does the cache version string have a character limit? If so we may need to re-hash the concatenated hashes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I had a feeling this would be an issue. I just left the long join for simplicity and to not introduce a hashing function in flytekit/core/task.py , but happy to hash that.

@cosmicBboy
Copy link
Contributor

cosmicBboy commented Nov 20, 2024

Just putting some thoughts down:

From the current design the order of the cache objects matters, which I'm okay with... I'd like to document the pattern to define one caching policy and re-using it like so:

cache_policy = [CacheFunctionBody(), CachePrivateModules(), ...]

@task(cache=cache_policy)
def my_task(): ...

I wonder if it would be useful to expose a Salt object in the case where I want to invalidate a list of cache objects without having to change the string on one of them arbitrarily:

#  multiple cache objects with separate salts
cache_policy = [CacheFunctionBody(salt="a"), CachePrivateModules(salt="a"), ...]

# I just want to invalidate the cache, but it feels weird and arbitrary to pick one object to bump
cache_policy = [CacheFunctionBody(salt="a"), CachePrivateModules(salt="b"), ...]

Doing this feels more natural

cache_policy = [CacheFunctionBody(), CachePrivateModules(), Salt("a")]

# bump cache
cache_policy = [CacheFunctionBody(), CachePrivateModules(), Salt("b")]

I still think it's nice to have a salt argument in the AutoCache.__init__ method for the case of a single cache object

@@ -174,7 +175,7 @@ def task(
def task(
_task_function: Optional[Callable[P, FuncOut]] = None,
task_config: Optional[T] = None,
cache: bool = False,
cache: Union[bool, list[AutoCache]] = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also accept a single AutoCache object?

@dansola
Copy link
Contributor Author

dansola commented Nov 21, 2024

Just putting some thoughts down:

From the current design the order of the cache objects matters, which I'm okay with... I'd like to document the pattern to define one caching policy and re-using it like so:

cache_policy = [CacheFunctionBody(), CachePrivateModules(), ...]

@task(cache=cache_policy)
def my_task(): ...

I wonder if it would be useful to expose a Salt object in the case where I want to invalidate a list of cache objects without having to change the string on one of them arbitrarily:

#  multiple cache objects with separate salts
cache_policy = [CacheFunctionBody(salt="a"), CachePrivateModules(salt="a"), ...]

# I just want to invalidate the cache, but it feels weird and arbitrary to pick one object to bump
cache_policy = [CacheFunctionBody(salt="a"), CachePrivateModules(salt="b"), ...]

Doing this feels more natural

cache_policy = [CacheFunctionBody(), CachePrivateModules(), Salt("a")]

# bump cache
cache_policy = [CacheFunctionBody(), CachePrivateModules(), Salt("b")]

I still think it's nice to have a salt argument in the AutoCache.__init__ method for the case of a single cache object

Thanks so much! Definitely agree re. a single salt. One idea I had was to use a dataclass where there is a list of AutoCache as one parameter and a salt string as another parameter. In this case some hashing would have to occur outside of AutoCache implementations but maybe that's not a big deal, especially if we want to hash the hashed of each individual implementation.

Also, I like that pattern of defining the cache policy as a constant that should be defined outside of the task decorator and reused. +1 to that.

@@ -343,10 +344,16 @@ def launch_dynamically():
"""

def wrapper(fn: Callable[P, Any]) -> PythonFunctionTask[T]:
if isinstance(cache, list) and all(isinstance(item, AutoCache) for item in cache):
cache_versions = [item.get_version() for item in cache]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to pass the function here

@cosmicBboy
Copy link
Contributor

cosmicBboy commented Nov 21, 2024

@dansola another idea is to have a CachePolicy class that takes a list of cache classes plus salt:

class CachePolicy:
    def __init__(self, *args, salt=""):
        self.cache_objects = args
        self.salt = salt

    def get_version(self, func):
        # use self.cache_objects to generate hashes

cache_policy = CachePolicy(
    CacheFunctionBody(),
    CachePrivateModules(),
    ...,
    salt="mysalt"
)

@task(cache=cache_policy)
def task_fn(): ...

edit: CachePolicy is just a placeholder name, feel free to propose other names for this

@cosmicBboy
Copy link
Contributor

also if we go with the CachePolicy idea, we won't need to handle a list[AutoCache] objects in the @task decorator.

@dansola
Copy link
Contributor Author

dansola commented Nov 21, 2024

also if we go with the CachePolicy idea, we won't need to handle a list[AutoCache] objects in the @task decorator.

Yup I really like that idea. I added it to the PR where I make a new implementation for image caching here: #2944

Right now all the PRs are branching off each other. It's probably cleaner to consolidate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants