initial make auto cache plugin #2912

dansola · 2024-11-07T00:26:15Z

Why are the changes needed?

Make caching easier to use in flytekit by reducing cognitive burden of specifying cache versions

What changes were proposed in this pull request?

AutoCache protocol added to flytekit with a salt argument and a get_version method. The task function is passed to get_version in the task decorator to determine a cache version. Then the new cache version follows the same path a user created cache version would.

auto_cache plugin created which will contain implementations of AutoCache. The first one is CacheFunctionBody which just checks the function body and ignores formatting and comments.

How was this patch tested?

Unit tests added which verify versions are consistent and change when we expect. Since changes to the function name will cause a different hash, we move dummy functions to a separate directory and import them so we can keep the name the same but test that the hash changes with the contents change.

Setup process

Screenshots

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

Signed-off-by: Daniel Sola <[email protected]>

dansola · 2024-11-07T00:31:24Z

@eapolinario @cosmicBboy Here is a draft PR doing roughly what we talked about. Would be great to get your opinions before I implement more hashing methods. A couple questions:

Is flytekit/core/task.py the right place to hash the task function body? i.e in the decorator right before we pass the cache version to TaskMetadata.
Do we still like the idea of using protocols and having a different implantation for different things like the function body, imports, imagespec, external packages, etc.? Right now cache will take a list[AutoCache] and we combine the hashes (we can hash them later so the string isn't long), but is the most user friendly as compared to some dataclass config that has a series of boolean arguments for example?

cosmicBboy · 2024-11-20T14:46:10Z

flytekit/core/auto_cache.py

+            Given a function, generates a version hash based on its source code and the salt.
+    """
+
+    def __init__(self, salt: str = "salt") -> None:


nit: can the default value be ""?

cosmicBboy · 2024-11-20T14:47:04Z

flytekit/core/task.py

+            cache_versions = [item.get_version() for item in cache]
+            task_hash = "".join(cache_versions)


@eapolinario does the cache version string have a character limit? If so we may need to re-hash the concatenated hashes.

Ah I had a feeling this would be an issue. I just left the long join for simplicity and to not introduce a hashing function in flytekit/core/task.py , but happy to hash that.

cosmicBboy · 2024-11-20T14:56:33Z

Just putting some thoughts down:

From the current design the order of the cache objects matters, which I'm okay with... I'd like to document the pattern to define one caching policy and re-using it like so:

cache_policy = [CacheFunctionBody(), CachePrivateModules(), ...]

@task(cache=cache_policy)
def my_task(): ...

I wonder if it would be useful to expose a Salt object in the case where I want to invalidate a list of cache objects without having to change the string on one of them arbitrarily:

#  multiple cache objects with separate salts
cache_policy = [CacheFunctionBody(salt="a"), CachePrivateModules(salt="a"), ...]

# I just want to invalidate the cache, but it feels weird and arbitrary to pick one object to bump
cache_policy = [CacheFunctionBody(salt="a"), CachePrivateModules(salt="b"), ...]

Doing this feels more natural

cache_policy = [CacheFunctionBody(), CachePrivateModules(), Salt("a")]

# bump cache
cache_policy = [CacheFunctionBody(), CachePrivateModules(), Salt("b")]

I still think it's nice to have a salt argument in the AutoCache.__init__ method for the case of a single cache object

cosmicBboy · 2024-11-20T14:58:57Z

flytekit/core/task.py

@@ -174,7 +175,7 @@ def task(
 def task(
    _task_function: Optional[Callable[P, FuncOut]] = None,
    task_config: Optional[T] = None,
-    cache: bool = False,
+    cache: Union[bool, list[AutoCache]] = False,


can we also accept a single AutoCache object?

dansola · 2024-11-21T00:48:27Z

Just putting some thoughts down:

From the current design the order of the cache objects matters, which I'm okay with... I'd like to document the pattern to define one caching policy and re-using it like so:
cache_policy = [CacheFunctionBody(), CachePrivateModules(), ...]

@task(cache=cache_policy)
def my_task(): ...
I wonder if it would be useful to expose a Salt object in the case where I want to invalidate a list of cache objects without having to change the string on one of them arbitrarily:
#  multiple cache objects with separate salts
cache_policy = [CacheFunctionBody(salt="a"), CachePrivateModules(salt="a"), ...]

# I just want to invalidate the cache, but it feels weird and arbitrary to pick one object to bump
cache_policy = [CacheFunctionBody(salt="a"), CachePrivateModules(salt="b"), ...]
Doing this feels more natural
cache_policy = [CacheFunctionBody(), CachePrivateModules(), Salt("a")]

# bump cache
cache_policy = [CacheFunctionBody(), CachePrivateModules(), Salt("b")]
I still think it's nice to have a salt argument in the AutoCache.__init__ method for the case of a single cache object

Thanks so much! Definitely agree re. a single salt. One idea I had was to use a dataclass where there is a list of AutoCache as one parameter and a salt string as another parameter. In this case some hashing would have to occur outside of AutoCache implementations but maybe that's not a big deal, especially if we want to hash the hashed of each individual implementation.

Also, I like that pattern of defining the cache policy as a constant that should be defined outside of the task decorator and reused. +1 to that.

dansola · 2024-11-21T01:40:02Z

flytekit/core/task.py

@@ -343,10 +344,16 @@ def launch_dynamically():
    """

    def wrapper(fn: Callable[P, Any]) -> PythonFunctionTask[T]:
+        if isinstance(cache, list) and all(isinstance(item, AutoCache) for item in cache):
+            cache_versions = [item.get_version() for item in cache]


need to pass the function here

cosmicBboy · 2024-11-21T20:59:42Z

@dansola another idea is to have a CachePolicy class that takes a list of cache classes plus salt:

class CachePolicy:
    def __init__(self, *args, salt=""):
        self.cache_objects = args
        self.salt = salt

    def get_version(self, func):
        # use self.cache_objects to generate hashes

cache_policy = CachePolicy(
    CacheFunctionBody(),
    CachePrivateModules(),
    ...,
    salt="mysalt"
)

@task(cache=cache_policy)
def task_fn(): ...

edit: CachePolicy is just a placeholder name, feel free to propose other names for this

cosmicBboy · 2024-11-21T22:06:43Z

also if we go with the CachePolicy idea, we won't need to handle a list[AutoCache] objects in the @task decorator.

dansola · 2024-11-21T22:13:07Z

also if we go with the CachePolicy idea, we won't need to handle a list[AutoCache] objects in the @task decorator.

Yup I really like that idea. I added it to the PR where I make a new implementation for image caching here: #2944

Right now all the PRs are branching off each other. It's probably cleaner to consolidate.

dansola added 3 commits November 5, 2024 15:07

wip

2786c5b

initial auto cache method

b18bac7

Signed-off-by: Daniel Sola <[email protected]>

fix tests to import dummy functions

50552a9

Signed-off-by: Daniel Sola <[email protected]>

dansola mentioned this pull request Nov 13, 2024

add recursive cache for private modules #2928

Draft

3 tasks

cosmicBboy reviewed Nov 20, 2024

View reviewed changes

dansola commented Nov 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial make auto cache plugin #2912

initial make auto cache plugin #2912

dansola commented Nov 7, 2024

dansola commented Nov 7, 2024

cosmicBboy Nov 20, 2024

cosmicBboy Nov 20, 2024

dansola Nov 21, 2024

cosmicBboy commented Nov 20, 2024 •

edited

Loading

cosmicBboy Nov 20, 2024

dansola commented Nov 21, 2024

dansola Nov 21, 2024

cosmicBboy commented Nov 21, 2024 •

edited

Loading

cosmicBboy commented Nov 21, 2024

dansola commented Nov 21, 2024

		cache_versions = [item.get_version() for item in cache]
		task_hash = "".join(cache_versions)

initial make auto cache plugin #2912

Are you sure you want to change the base?

initial make auto cache plugin #2912

Conversation

dansola commented Nov 7, 2024

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

dansola commented Nov 7, 2024

cosmicBboy Nov 20, 2024

Choose a reason for hiding this comment

cosmicBboy Nov 20, 2024

Choose a reason for hiding this comment

dansola Nov 21, 2024

Choose a reason for hiding this comment

cosmicBboy commented Nov 20, 2024 • edited Loading

cosmicBboy Nov 20, 2024

Choose a reason for hiding this comment

dansola commented Nov 21, 2024

dansola Nov 21, 2024

Choose a reason for hiding this comment

cosmicBboy commented Nov 21, 2024 • edited Loading

cosmicBboy commented Nov 21, 2024

dansola commented Nov 21, 2024

cosmicBboy commented Nov 20, 2024 •

edited

Loading

cosmicBboy commented Nov 21, 2024 •

edited

Loading