Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Add dynamic data and caching to data manager #398

Merged
merged 45 commits into from
Apr 11, 2024

Conversation

antonymilne
Copy link
Contributor

@antonymilne antonymilne commented Apr 2, 2024

Description

Vizro's most overdue PR is finally here! I will make some nice diagrams and explain this more in a learning session, but for now the best way to understand this change is to forget everything you currently know about the data manager and read the new docs: https://vizro--398.org.readthedocs.build/en/398/pages/user-guides/data/.

Reviewers

Don't be scared by the number of files changed. Most of this is find and replace in the docs. The only Python code that's changed substantially is data_manager.py, and even that is mainly lengthy comments. Don't worry too much about poring over the code. What's important to review here is:

  • completely updated docs section on data - easier to just look at the built docs
  • data manager functionality actually works as documented

Things to test @petar-qb @l0uden

  • Manually and automated
  • gunicorn, flask
  • redis
  • named DS, unnamed, dynamic, static
  • DASH_DEBUG mode

TODO in this PR

  • Fix tests and write some new ones
  • Try out callable with parametrised args like Max example - request should be possible but need to check page.build, see feat/data-manager-enhancements. Overriding *args, **kwargs should also be possible. Do have access to Dash ctx in data loading function.
  • Check you can update/invalidate/reset memoize on demand? Yes, also works for timeout=0 dynamic data. See feat/data-manager-enhancements.
  • Try out with live data changes, dcc.Interval etc.

TODO in future PRs

  • Convert all dynamic examples to static
  • Ban . in id and other characters?). Note need for things that appear in page title though. Limit dataset names to valid_chars = set(string.ascii_letters + string.digits + "_.") so works with flask caching
  • Do we still need components to dataset mapping? No but save this for future PR. Put mapping to dataset in callable model itself. So it's still one DS to many components but no need to store mapping here.

Notice

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

    • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
    • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorized to submit this contribution on behalf of the original creator(s) or their licensees.
    • I certify that the use of this contribution as authorized by the Apache 2.0 license does not violate the intellectual property rights of anyone else.
    • I have not referenced individuals, products or companies in any commits, directly or indirectly.
    • I have not added data or restricted code in any commits, directly or indirectly.

@antonymilne antonymilne changed the title Add refreshable cache to data manager [Feat] Add dynamic data and caching to data manager Apr 5, 2024

Unlike static data, dynamic data cannot be supplied directly into the `data_frame` argument of a `figure`. Instead, it must first be added to the Data Manager and then referred to by name.

!!! example "Dynamic data"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a terrible example of using dynamic data because the iris.csv file is static and we have a static png anyway so you can't see anything update live.

We should come up with a better example and an animated gif to show off the live update but not in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand that we won't update the example in this PR, but before I forget - when we add an example, some weather forecast dashboard might be nice. Then, we can also show them how to refresh the data daily.

https://openweathermap.org/api

Or some model interaction example, where we update the model output after each run (this might be a bit more complex though)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea maybe for this PR still could be to make this function time dependant, and filter the DF for even or odd data given the hour of the day. Then at some point we could make a nice real example with stock market data, weather data or the likes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a quick thing for now I will do something where it just plots the iris dataset but with different random points selected or something like that. And then in the future we can make this a more interesting example like the weather where we call some external API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in acb546a - see what you think.

@maxschulz-COL you don't know of any reason why we wouldn't be able to host iris.csv on RTD and download it as done here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I also haven't tried it. But I like the new approach. The only thing I am unsure about is the following:

Should the code example really have pd.read_csv (I know it is to make it clear you are fetching from a file) or actually px.data.iris(), ie something that will always work?

I kinda prefer the former, but with a message similar to the one we already have, ie that here we kinda simulate the dynamic case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had this done with px.data.iris() originally and then decided that wasn't clear enough because it's too far removed from a "real" working example where the data can change. So better to use pd.read_csv even though that means downloading a file to get the example working.

The download works ok on the automatically built docs for this PR so presumably it will also work with the released docs 🤞

vizro-core/docs/pages/user-guides/dynamic-data.md Outdated Show resolved Hide resolved
Copy link
Contributor

@huong-li-nguyen huong-li-nguyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WOOOOOOOOOW!!! 🚀 🔥 🦸‍♂️

I haven't properly tested this out, as I plan to do that after the unit tests have been added. But let me know if you still want me to test this prior to having all the required tests in.

Just wanted to say, that I love the documentation! Wrote down some notes for myself 👩‍🎓 Not only the user guides but also the documentation and code examples left in the docstrings of the relevant classes. It was very clear and easy to understand how I need to configure static/dynamic data and cache 👍

I have a few questions, but I think we can all discuss them during TL 👥 📚


Unlike static data, dynamic data cannot be supplied directly into the `data_frame` argument of a `figure`. Instead, it must first be added to the Data Manager and then referred to by name.

!!! example "Dynamic data"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand that we won't update the example in this PR, but before I forget - when we add an example, some weather forecast dashboard might be nice. Then, we can also show them how to refresh the data daily.

https://openweathermap.org/api

Or some model interaction example, where we update the model output after each run (this might be a bit more complex though)?

vizro-core/docs/pages/user-guides/dynamic-data.md Outdated Show resolved Hide resolved
vizro-core/src/vizro/_vizro.py Outdated Show resolved Hide resolved
vizro-core/src/vizro/models/types.py Outdated Show resolved Hide resolved
Copy link
Contributor

@stichbury stichbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. I made a couple of minor suggestions based on my pedantry about data updates -- see what you think. I'll make a vale update on the branch if that's OK. (Vale isn't running at present as I need to finish PR #391 with @maxschulz-COL but it's in the codebase)

Copy link
Contributor

@maxschulz-COL maxschulz-COL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approval from my side, given the pending docs discussion.

@petar-qb
Copy link
Contributor

petar-qb commented Apr 11, 2024

For the documentation purpose (as requested), here is a set of cases I tested:

  • Time/memory usage differences between using static data:
    • when the DataFrame is directly supplied by a value eg: vm.Graph(figure=px.scatter(px.data.iris(), ...),
    • when the DataFrame is directly supplied by a reference eg: vm.Graph(figure=px.scatter(iris, ...),
    • when the DataFrame is supplied by a data_manager name reference eg: vm.Graph(figure=px.scatter("iris",..),
  • While changing source data, how the app behaves (focusing on unexpected outputs/bugs/users warning messages/memory&time performances/cache misses/process IDs/dataframe memory locations/clearing cache and its persistence) - (every possible combination of the following is tested):
    1. When used cache is (with various different input configurations)
    • NullCache
    • SimpleCache
    • FileSystemCache
    • RedisCache
    1. When the app is run with:
    • app.run(),
    • app.run(processes=2, threaded=False),
    • gunicorn -w 2 app:server -b localhost:8050,
    • gunicorn -w 2 app:server -b localhost:8050 --preload
    1. When timeout by data source is set to: "CACHE_DEFAULT_TIMEOUT",0, 20, -1.
    2. When the data source is used in zero, one or many figures.
    3. When some source code changes are made like: these.

@antonymilne antonymilne enabled auto-merge (squash) April 11, 2024 15:38
@huong-li-nguyen huong-li-nguyen self-requested a review April 11, 2024 18:03
@antonymilne antonymilne merged commit f4c627f into main Apr 11, 2024
34 checks passed
@antonymilne antonymilne deleted the feat/data-manager branch April 11, 2024 19:24
@huong-li-nguyen
Copy link
Contributor

huong-li-nguyen commented Apr 11, 2024

@antonymilne - just note: The PR initially couldn't be merged automatically because they failed the Snyk tests. I've double-checked and most of them are low security issues connected to the newly added requirements redis, flask-cache etc.

Screenshot 2024-04-11 at 21 19 41

I've marked the snyk tests as passing on snyk, such that you can merge but we might want to come back to the requirements at some point or decide whether these snyk tests make sense at all, especially the one on flask-caching we might have to double-check again.

@antonymilne
Copy link
Contributor Author

@huong-li-nguyen ah, thank you for merging - I was actually battling snyk and trying to figure out what to do the last couple of hours though. The result is #417.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants