Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[card-cache-service] optimize caching #417

Merged
merged 2 commits into from
Mar 20, 2024

Conversation

valayDave
Copy link
Contributor

@valayDave valayDave commented Feb 26, 2024

Core Changes:

  • A background service that will poll all the cards/data update for individual tasks
  • we will run one process per taskspec for a small amount of time specified via the ENV vars
  • This service will be launched when cards are requested.
  • The reading of cards will be happening directly from the cache and the reads will be "best-effort"; Meaning that when running this in a load-balanced setting, one server can be sending stale updates compared to another server. A workaround for this was passing the time of the update from the metaflow task. This commit in metaflow will ensure that stale updates can get discarded.
  • API routes to get a card/list cards will have async-waits until the cache is available or there is a timeout.
  • Requires latest metaflow client.
  • The new optimization will require the MF GUI to also be up-to-date with the new server.
  • Metaflow UI changes will make it poll the server every 500 ms.
  • changes come with async routines that will clean up the cache and remove completed async processes
  • removed dead code

Why not use the existing cache client:

  • The way the existing cache client works, it loads the entire Task / Card object in memory and then returns the html/data from it.
  • This is inefficient because loading the Card object does datastore list calls (which are time expensive).
  • Once the path has been found, getting the object is a very fast operation.
  • For example, listing cards takes ~ 1-2 seconds, but getting the actual card once the path is resolved takes ~ 10 milliseconds.
  • The current cache actions are "stateless" meaning, that once the action is done, the previous state is lost when a new action is called.
  • This stateless nature is not good for cards, where the data may change a lot more frequently but paths won't change.
  • The new cache service retrieves the object paths once and then keeps updating them until the background process finishes execution.
  • This approach improves latency drastically

Configuration Options:

  • CARD_CACHE_PROCESS_NO_CARD_WAIT_TIME : How long should the process wait for a card to be available before it exits
  • CARD_CACHE_PROCESS_MAX_UPTIME : The max duration the process should run
  • CARD_CACHE_CARD_LIST_POLLING_FREQUENCY : How frequently should the process poll for listing new cards
  • CARD_CACHE_CARD_UPDATE_POLLING_FREQUENCY : How frequently should the process poll for the card html content
  • CARD_CACHE_DATA_UPDATE_POLLING_FREQUENCY : How frequently should the process poll for the data updates
  • CARD_CACHE_DISK_CLEANUP_INTERVAL: The interval at which the cached cards are stored should be cleaned up
  • CARD_API_HTML_WAIT_TIME: the time period the card HTML retrieval API will max busy wait for the card to be ready before timing out and resulting in null response.

TODOs

 # Core Changes:
- created a background service that will poll all the cards/dataupdate for individual tasks
- The background service will run one process per taskspec for a small amount of time specified via the ENV vars
- This service will be launched when cards are requested.
- The reading of cards will be happenning directly from cache and the reads will be "best-effort"
- API routes to get a card / list cards will have async-waits until the cache is updated.
- The new optimization will require the MF GUI to also be up-to-date with the new server.
- Uses a new optimized mf client.
- Metaflow UI which keeps best effor polling new cards every 0.5 seconds can work best with new server.
- async routines that will clean up the cache and remove completed async-processes
- removed dead code which will no longer be used.

 # Why not use the existing cache client:
- The way the existing cache client works, it loads the entire `Task` / `Card` object in memory and then returns the html/data from it.
- This is inefficient because load the `Card` object does datastore list calls which are time expensive.
- Once the path to the cards/data-updates has been found, getting the actual object is very fast.
- For example, listing cards, takes ~ 1-2 seconds, but getting the actual card once the path is resolved takes ~ 10 milliseconds.
- The current cache actions are "stateless" meaning, that once the action is done, the previous state is lost when a new action is called.
- This stateless nature is not good for cards, where the data may change a lot more frequently but paths won't change.
- The new cache service retrives the object paths once and then keeps updating them until the background-process finishes execution.
- This approach improves latency drastically

 # Configuration Options:
- `CARD_CACHE_PROCESS_NO_CARD_WAIT_TIME` : How long should the process wait for a card to be available before it exits
- `CARD_CACHE_PROCESS_MAX_UPTIME` : The max duration the process should run
- `CARD_CACHE_CARD_LIST_POLLING_FREQUENCY` : How frequently should the process poll for listing new cards
- `CARD_CACHE_CARD_UPDATE_POLLING_FREQUENCY` : How frequently should the process poll for the card html content
- `CARD_CACHE_DATA_UPDATE_POLLING_FREQUENCY` : How frequently should the process poll for the data updates
- `CARD_CACHE_DISK_CLEANUP_INTERVAL`: The interval at which the cached cards are stored should be cleaned up
- `CARD_API_HTML_WAIT_TIME`: the timeperiod the card HTML retrieval API will max busy wait for the card to be ready before timing out and resulting in null response.

async def _get_latest_return_code(process: Process):
with contextlib.suppress(asyncio.TimeoutError):
await asyncio.wait_for(process.wait(), 1e-6)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some notes for clarity.. so the combination of wait_for and async process.wait() with a minimal timeout in effect acts in a similar way as process.poll() from the synchronous implementation?

valayDave added a commit to Netflix/metaflow-ui that referenced this pull request Mar 14, 2024
- make refresh interval default smaller
- remove timeout code block in card load
- leveraging the `created_on` timestamps in dataupdates to discard stale data
- handle case of idle refresh
- comments on whats changed.
- compatible with new changes introduced in Netflix/metaflow-service#417
@saikonen saikonen merged commit 93be3c3 into Netflix:master Mar 20, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants