Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalidate compiled script cache when it is updated #2434

Open
killme2008 opened this issue Sep 18, 2023 · 18 comments
Open

Invalidate compiled script cache when it is updated #2434

killme2008 opened this issue Sep 18, 2023 · 18 comments
Labels
C-bug Category Bugs C-enhancement Category Enhancements good first issue Good for newcomers help wanted Extra attention is needed

Comments

@killme2008
Copy link
Contributor

killme2008 commented Sep 18, 2023

What type of enhancement is this?

User experience, API

What does the enhancement do?

We have to invalidate other frontend's compiled python script cache when the script is updated in cluster mode.

We need an event broadcast mechanism to do it.

Implementation challenges

No response

@killme2008 killme2008 added C-bug Category Bugs C-enhancement Category Enhancements labels Sep 18, 2023
@MichaelScofield
Copy link
Collaborator

It can be done by calling Mailbox::broadcast to submit a "script invalidation request" to all the frontend nodes. There's an example to start with, see struct InvalidateCache.

@MichaelScofield MichaelScofield added the good first issue Good for newcomers label Feb 26, 2024
@killme2008 killme2008 added the help wanted Extra attention is needed label Feb 26, 2024
@xxxuuu
Copy link
Contributor

xxxuuu commented Mar 24, 2024

Hi, I would like to try this one. Could you assign it to me?

@WenyXu
Copy link
Member

WenyXu commented Mar 24, 2024

Hi, I would like to try this one. Could you assign it to me?

Have fun🚀

@xxxuuu
Copy link
Contributor

xxxuuu commented Mar 27, 2024

Hi, I would like to try this one. Could you assign it to me?

Have fun🚀

Hi @WenyXu, I plan to communicate via Frontend--notify-->Metasrv--broadcast-->Frontend. When I implemented it, I found that mailbox::send in Frontend requires an InstructionReply, and it seems that Instruction is always sent by Metasrv as heartbeat response. If I make Frontend notify Metasrv, it seems to break this semantics and design. Should I do it this way? Cloud you give me any suggestions?

@WenyXu
Copy link
Member

WenyXu commented Mar 28, 2024

Hi, I would like to try this one. Could you assign it to me?

Have fun🚀

Hi @WenyXu, I plan to communicate via Frontend--notify-->Metasrv--broadcast-->Frontend. When I implemented it, I found that mailbox::send in Frontend requires an InstructionReply, and it seems that Instruction is always sent by Metasrv as heartbeat response. If I make Frontend notify Metasrv, it seems to break this semantics and design. Should I do it this way? Cloud you give me any suggestions?

PTAL @waynexia

@WenyXu
Copy link
Member

WenyXu commented Mar 29, 2024

Hi, I would like to try this one. Could you assign it to me?

Have fun🚀

Hi @WenyXu, I plan to communicate via Frontend--notify-->Metasrv--broadcast-->Frontend. When I implemented it, I found that mailbox::send in Frontend requires an InstructionReply, and it seems that Instruction is always sent by Metasrv as heartbeat response. If I make Frontend notify Metasrv, it seems to break this semantics and design. Should I do it this way? Cloud you give me any suggestions?

PTAL @waynexia

It seems we need to design an RPC service to accept cache invalidation requests from the Frontend. Would you like to implement this RPC service? We need another issue to track this feature.

@waynexia
Copy link
Member

We still cannot ensure a correct result. The step broadcast -> other frontends is not synced. Let's implement this first to achieve a relax invalid process.

@xxxuuu
Copy link
Contributor

xxxuuu commented Mar 29, 2024

We still cannot ensure a correct result. The step broadcast -> other frontends is not synced. Let's implement this first to achieve a relax invalid process.

In this context, what level of consistency are we expecting?

@waynexia
Copy link
Member

Expect read-after-write like other insertion operations.

@xxxuuu
Copy link
Contributor

xxxuuu commented Mar 29, 2024

Expect read-after-write like other insertion operations.

My plan is to add a version field to the scripts table (or use gmt_modified, but I'm not sure if there will be clock drift).

When Frontend try to execute a script, it needs to read the scripts table and determine if the local cache is outdated.

If the read and write operations follow a read-after-write consistency, it ensures the latest data.

The cost is that we have to read the scripts table, which adds an extra RTT. However, I believe this is acceptable because I think caching the script is not intended to save this particular RTT, but rather to reduce the CPU overhead during script compilation.

Moreover, we no longer need to implement broadcasting in the MetaSrv.

Do you think this plan is feasible?

@waynexia
Copy link
Member

version should be enough. The map from script to "compiled binary" must (and already) exist. We only need to leverage some cache invalidation mechanism over version. Having an extra read to datanode also looks fine to me. Data feeds into scripts also come from datanode. And there are ways to reduce this overhead.

This does override the need for broadcasting if we choose to implement this "strict" mode now Can you give a more detailed per-step plan of how you break this task down and plan to implement it?

@waynexia
Copy link
Member

cc @discord9

@xxxuuu
Copy link
Contributor

xxxuuu commented Apr 1, 2024

version should be enough. The map from script to "compiled binary" must (and already) exist. We only need to leverage some cache invalidation mechanism over version. Having an extra read to datanode also looks fine to me. Data feeds into scripts also come from datanode. And there are ways to reduce this overhead.

This does override the need for broadcasting if we choose to implement this "strict" mode now Can you give a more detailed per-step plan of how you break this task down and plan to implement it?

Sure.

I have made some changes to my plan because I found a new issue. After inserting a script, it cannot be executed through SQL in other Frontends unless it is first executed through the HTTP API using /run-script. This is because scripts are only registered as UDFs in the ScriptManager::compile. I previously overlooked the possibility of executing it directly through UDF instead of going through the /run-script API.

Taking the UDF part into consideration, here is my plan:

  1. First, add a providers field and a get_function_fallback function to FunctionRegistry. The get_function_fallback function will attempt to dynamically retrieve the Function through the providers.
  2. Then, In the QueryEngineState::udf_function function, use FUNCTION_REGISTRY.get_function_fallback to retrieve the latest UDF.
  3. The ScriptManager will implement this provider. I plan to modify the try_find_script_and_compile function to read from the scripts table each time and compare it with the locally cached version to determine whether to recompile and update the cache(ScriptManager::compiled).
  4. Of course, there will be an additional field in the cache to indicate the version. But I'm not sure whether to add a new version field to scripts table or simply use gmt_modifie. I'm concerned that adding a new field might cause some complications during the upgrade.

Now, whether it is called through the HTTP API or SQL UDF, the latest script can be executed.

@killme2008
Copy link
Contributor Author

version should be enough. The map from script to "compiled binary" must (and already) exist. We only need to leverage some cache invalidation mechanism over version. Having an extra read to datanode also looks fine to me. Data feeds into scripts also come from datanode. And there are ways to reduce this overhead.
This does override the need for broadcasting if we choose to implement this "strict" mode now Can you give a more detailed per-step plan of how you break this task down and plan to implement it?

Sure.

I have made some changes to my plan because I found a new issue. After inserting a script, it cannot be executed through SQL in other Frontends unless it is first executed through the HTTP API using /run-script. This is because scripts are only registered as UDFs in the ScriptManager::compile. I previously overlooked the possibility of executing it directly through UDF instead of going through the /run-script API.

Taking the UDF part into consideration, here is my plan:

  1. First, add a providers field and a get_function_fallback function to FunctionRegistry. The get_function_fallback function will attempt to dynamically retrieve the Function through the providers.
  2. Then, In the QueryEngineState::udf_function function, use FUNCTION_REGISTRY.get_function_fallback to retrieve the latest UDF.
  3. The ScriptManager will implement this provider. I plan to modify the try_find_script_and_compile function to read from the scripts table each time and compare it with the locally cached version to determine whether to recompile and update the cache(ScriptManager::compiled).
  4. Of course, there will be an additional field in the cache to indicate the version. But I'm not sure whether to add a new version field to scripts table or simply use gmt_modifie. I'm concerned that adding a new field might cause some complications during the upgrade.

Now, whether it is called through the HTTP API or SQL UDF, the latest script can be executed.

Looks good to me! Let's take a step.

@killme2008
Copy link
Contributor Author

Expect read-after-write like other insertion operations.

My plan is to add a version field to the scripts table (or use gmt_modified, but I'm not sure if there will be clock drift).

When Frontend try to execute a script, it needs to read the scripts table and determine if the local cache is outdated.

If the read and write operations follow a read-after-write consistency, it ensures the latest data.

The cost is that we have to read the scripts table, which adds an extra RTT. However, I believe this is acceptable because I think caching the script is not intended to save this particular RTT, but rather to reduce the CPU overhead during script compilation.

Moreover, we no longer need to implement broadcasting in the MetaSrv.

Do you think this plan is feasible?

A version field is great, but I wonder how to process the atomic increment of this field, greptimedb doesn't support transactions and update currently.

@xxxuuu
Copy link
Contributor

xxxuuu commented Apr 1, 2024

Expect read-after-write like other insertion operations.

My plan is to add a version field to the scripts table (or use gmt_modified, but I'm not sure if there will be clock drift).
When Frontend try to execute a script, it needs to read the scripts table and determine if the local cache is outdated.
If the read and write operations follow a read-after-write consistency, it ensures the latest data.
The cost is that we have to read the scripts table, which adds an extra RTT. However, I believe this is acceptable because I think caching the script is not intended to save this particular RTT, but rather to reduce the CPU overhead during script compilation.
Moreover, we no longer need to implement broadcasting in the MetaSrv.
Do you think this plan is feasible?

A version field is great, but I wonder how to process the atomic increment of this field, greptimedb doesn't support transactions and update currently.

This is indeed a problem, and I overlooked this point. Both of them have some issues

  • version may have multiple writes to the same version during concurrency, which and may also lead to version rollback.
  • Updating gmt_modified is not a 'read-modify-write' operation; it simply inserts the latest value, similar to 'last write wins.' However, it can still have similar issues during concurrency when there is clock drift between multiple frontends. If Node A has a faster clock than Node B, concurrent updates from A and B may occur, but B's write arrives later. The cache on Node A cannot be updated because its timestamp is greater than the data written by B.

I have browsed the documentation and found Metasrv Distributed Lock. Perhaps it can be used to ensure atomicity when updating. The likelihood of concurrent updates to the same script is low, so the lock contention is minimal. Users are aware of what they are doing, I believe it is acceptable.

@killme2008
Copy link
Contributor Author

killme2008 commented Apr 1, 2024

I have browsed the documentation and found Metasrv Distributed Lock. Perhaps it can be used to ensure atomicity when updating. The likelihood of concurrent updates to the same script is low, so the lock contention is minimal. Users are aware of what they are doing, I believe it is acceptable.

@xxxuuu Yes, I believe it's ok.

@xxxuuu
Copy link
Contributor

xxxuuu commented Apr 16, 2024

I don't have much time on weekdays due to work, but I'm still dedicated to this issue. In fact, I have already completed the logical implementation of the functionality, but testing has not been done yet. I will be able to finish it this week :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category Bugs C-enhancement Category Enhancements good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants