Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
antonymilne committed Apr 5, 2024
1 parent f0763bf commit e0b8725
Show file tree
Hide file tree
Showing 6 changed files with 55 additions and 44 deletions.
8 changes: 4 additions & 4 deletions vizro-core/docs/pages/user-guides/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,19 @@

Vizro supports two different types of data:

* [Static data](static-data.md): pandas DataFrame. This is the simplest method that is most suitable for beginners and anyone who does not need the more advanced functionality of dynamic data.
* [Static data](static-data.md): pandas DataFrame. This is the simplest method and best to use if you do not need the more advanced functionality of dynamic data.
* [Dynamic data](dynamic-data.md): function that returns a pandas DataFrame. This is a bit more complex to understand but has more advanced functionality such as the ability to refresh data while the dashboard is running.

??? note "Static vs. dynamic data comparison"

Do not worry if you do not yet understand everything in this table. It will become clearer after reading this page!
Do not worry if you do not yet understand everything in this table. It will become clearer after reading more about [static data](static-data.md) and [dynamic data](dynamic-data.md)!

| | Static | Dynamic |
|---------------------------------------------------------------|------------------|------------------------------------------|
| Required Python type | pandas DataFrame | Function that returns a pandas DataFrame |
| Can be supplied directly in `data_frame` argument of `figure` | Yes | No |
| Can be referred to by name after adding to Data Manager | Yes | Yes |
| Can be refreshed while dashboard is running | No | Yes |
| Production-ready | Yes | Yes (assuming suitable cache backend) |
| Production-ready | Yes | Yes |

If you have a Kedro project or would like to use the [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) to manage your data independently of a Kedro project then you should use Vizro's [integration with the Kedro Data catalog](kedro-data-catalog.md_). This provides helper functions to add [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) datasets as dynamic data in the Vizro Data Manager.
If you have a [Kedro](https://kedro.org/) project or would like to use the [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) to manage your data independently of a Kedro project then you should use Vizro's [integration with the Kedro Data Catalog](kedro-data-catalog.md_). This provides helper functions to add [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) datasets as dynamic data in the Vizro Data Manager.
64 changes: 34 additions & 30 deletions vizro-core/docs/pages/user-guides/dynamic-data.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Dynamic data

A dynamic data source is a Python function that returns a pandas DataFrame. This function is executed when the dashboard is initially started and _can be executed again while the dashboard is running_. This makes it possible to refresh the data shown in your dashboard without restarting the dashboard itself. If you do not need to this functionality then you should use [static data](static-data.md) instead.
A dynamic data source is a Python function that returns a pandas DataFrame. This function is executed when the dashboard is initially started and _can be executed again while the dashboard is running_. This makes it possible to refresh the data shown in your dashboard without restarting the dashboard itself. If you do not this functionality then you should use [static data](static-data.md) instead.

Unlike static data, dynamic data cannot be supplied directly into the `data_frame` argument of a `figure`. Instead, it must first be added to the Data Manager and then referred to by name.

Expand Down Expand Up @@ -32,7 +32,7 @@ Unlike static data, dynamic data cannot be supplied directly into the `data_fram
Vizro().build(dashboard).run()
```

1. To use `load_iris_data` as dynamic data it must first be added to the Data Manager. You should **not** actually call the function `load_iris_data()`; doing so would result in static data that cannot be reloaded.
1. To use `load_iris_data` as dynamic data it must first be added to the Data Manager. You should **not** actually call the function as `load_iris_data()`; doing so would result in static data that cannot be reloaded.
2. Dynamic data is referred to by the name of the data source `"iris"`.

=== "Result"
Expand All @@ -42,50 +42,29 @@ Unlike static data, dynamic data cannot be supplied directly into the `data_fram

Since dynamic data sources must always be added to the Data Manager and referred to by name, they may be used in YAML configuration [exactly the same way as for static data sources](static-data.md#reference-by-name).

## Data refresh
## Configure cache

By default, dynamic data is cached in the Data Manager for 5 minutes. A refresh of the dashboard within this time interval will fetch the cached pandas DataFrame and not reload the data from disk. Once the cache timeout period has elapsed, the next refresh of the dashboard will re-execute the dynamic data loading function. The resulting pandas DataFrame will again be put into the cache and not expire until another 5 minutes has elapsed.
By default, every time the dashboard is refreshed a dynamic data function will be executed again. This means that your dashboard will always show the very latest data. In fact, if there are multiple graphs on the same page using the same dynamic data source then the loading function will be executed _multiple_ times, once for each graph on the page. Hence, if loading your data is a slow operation, your dashboard performance may suffer.

You can change the timeout of the cache independently for each dynamic data source in the Data Manager using the `timeout` setting (measured in seconds). A `timeout` of 0 indicates that the cache does not expire.
```py title="Set the cache timeout"
from vizro.managers import data_manager

# Cache of default_expire_data expires every 5 minutes, the default
data_manager["default_expire_data"] = ...

# Set cache of fast_expire_data to expire every 10 seconds
data_manager["fast_expire_data"] = ...
data_manager["fast_expire_data"].timeout = 10

# Set cache of slow_expire_data to expires every hour
data_manager["slow_expire_data"] = ...
data_manager["slow_expire_data"].timeout = 60 * 60
The Vizro Data Manager has a caching mechanism to help solve this. Vizro's cache uses [Flask-Caching](https://flask-caching.readthedocs.io/en/latest/), which supports a number of possible cache backends and [configuration options](https://flask-caching.readthedocs.io/en/latest/#configuring-flask-caching). By default, the cache is turned off.

# Set cache of no_expire_data to never expire
data_manager["no_expire_data"] = ...
data_manager["no_expire_data"].timeout = 0
```

## Cache configuration

In addition to controlling the timeout of your dynamic data, you can configure the underlying caching mechanism and its settings. Vizro's Data Manager cache uses [Flask-Caching](https://flask-caching.readthedocs.io/en/latest/), which supports a number of possible cache backends and [configuration options](https://flask-caching.readthedocs.io/en/latest/#configuring-flask-caching).

By default, the Data Manager uses a [simple memory cache](https://cachelib.readthedocs.io/en/stable/simple/) with the default configuration options. This is equivalent to the following:
In a development environment the easiest way to enable caching is to use a [simple memory cache](https://cachelib.readthedocs.io/en/stable/simple/) with the default configuration options:

```py title="Simple cache with default timeout of 5 minutes"
from flask_caching import Cache

data_manager.cache = Cache(config={"CACHE_TYPE": "SimpleCache"})
data_manager["iris"] = load_iris_data
```

By default, dynamic data is cached in the Data Manager for 5 minutes. A refresh of the dashboard within this time interval will fetch the pandas DataFrame from the cache and not re-run the data loading function. Once the cache timeout period has elapsed, the next refresh of the dashboard will re-execute the dynamic data loading function. The resulting pandas DataFrame will again be put into the cache and not expire until another 5 minutes has elapsed.

If you would like to alter some options, such as the default cache timeout, then you can specify a different cache configuration:

```py title="Simple cache with timeout set to 10 minutes"
data_manager.cache = Cache(config={"CACHE_TYPE": "SimpleCache", "CACHE_DEFAULT_TIMEOUT": 600})
```

The `timeout` setting that can be set individually for each dynamic data source takes precedence over the `CACHE_DEFAULT_TIMEOUT`setting, which just sets the value of `timeout` for data sources that do not explicitly set it.

!!! warning

Simple cache exists purely for single-process development purposes and is not intended to be used in production. If you deploy with multiple workers, [for example with gunicorn](run.md/#gunicorn), then you should use a production-ready cache backend. All of Flask-Caching's [built-in backends](https://flask-caching.readthedocs.io/en/latest/#built-in-cache-backends) other than `SimpleCache` are suitable for production. In particular, you might like to use [`FileSystemCache`](https://cachelib.readthedocs.io/en/stable/file/) or [`RedisCache`](https://cachelib.readthedocs.io/en/stable/redis/):
Expand All @@ -97,3 +76,28 @@ The `timeout` setting that can be set individually for each dynamic data source
# Use Redis key-value store
data_manager.cache = Cache(config={"CACHE_TYPE": "RedisCache", "CACHE_REDIS_HOST": "localhost", "CACHE_REDIS_PORT": 6379})
```

### Configure timeouts

You can change the timeout of the cache independently for each dynamic data source in the Data Manager using the `timeout` setting (measured in seconds). A `timeout` of 0 indicates that the cache does not expire. This is effectively the same as using [static data](static-data.md).
```py title="Set the cache timeout for each dynamic data source"
from vizro.managers import data_manager
from flask_caching import Cache

data_manager.cache = Cache(config={"CACHE_TYPE": "SimpleCache", "CACHE_DEFAULT_TIMEOUT": 600})

# Cache of default_expire_data expires every 10 minutes, the default set by CACHE_DEFAULT_TIMEOUT
data_manager["default_expire_data"] = load_iris_data

# Set cache of fast_expire_data to expire every 10 seconds
data_manager["fast_expire_data"] = load_iris_data
data_manager["fast_expire_data"].timeout = 10

# Set cache of slow_expire_data to expires every hour
data_manager["slow_expire_data"] = load_iris_data
data_manager["slow_expire_data"].timeout = 60 * 60

# Set cache of no_expire_data to never expire
data_manager["no_expire_data"] = load_iris_data
data_manager["no_expire_data"].timeout = 0
```
2 changes: 1 addition & 1 deletion vizro-core/docs/pages/user-guides/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ in the command line. For more Gunicorn configuration options, please refer to [G

!!! warning "In production"

If your dashboard uses [dynamic data](dynamic-data.md) that can be refreshed while the dashboard is running then you should [configure your Data Manager cache](data.md#cache-configuration) to use a backend that supports multiple processes. The Vizro default simple caching mechanism is only suitable for single-process development purposes and is not intended to be used in production.
If your dashboard uses [dynamic data](dynamic-data.md) that can be refreshed while the dashboard is running then you should [configure your Data Manager cache](data.md#configure-cache) to use a backend that supports multiple processes.

## Deployment

Expand Down
9 changes: 4 additions & 5 deletions vizro-core/docs/pages/user-guides/static-data.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Static data

A static data source is the simplest way to provide data to your dashboard and should be used for any data that does not need to be reloaded while the dashboard is running. It is production-ready and works out of the box in a multi-process deployment. If you need data that can be refreshed without restarting the dashboard then you should use [dynamic data](dynamic-data.md).
A static data source is the simplest way to provide data to your dashboard and should be used for any data that does not need to be reloaded while the dashboard is running. It is production-ready and works out of the box in a multi-process deployment. If you need data to be refreshed without restarting the dashboard then you should instead use [dynamic data](dynamic-data.md).

## Supply directly

Expand Down Expand Up @@ -35,15 +35,14 @@ You can directly supply a pandas DataFrame into components such as [graphs](grap

[DataBasic]: ../../assets/user_guides/data/data_pandas_dataframe.png

The [`Graph`][vizro.models.Graph], [`AgGrid`][vizro.models.AgGrid] and [`Table`][vizro.models.Table] models all have an argument called `figure`. This accepts a function (in the above example, `px.scatter`) which always takes a pandas DataFrame as its first argument. The name of this argument is always `data_frame`. When configuring the dashboard using Python, it is optional to give the name of the argument (so you could write `data_frame=iris`); when specifying the dashboard configuration through YAML, the argument name must be given.

The [`Graph`][vizro.models.Graph], [`AgGrid`][vizro.models.AgGrid] and [`Table`][vizro.models.Table] models all have an argument called `figure`. This accepts a function (in the above example, `px.scatter`) that takes a pandas DataFrame as its first argument. The name of this argument is always `data_frame`. When configuring the dashboard using Python, it is optional to give the name of the argument: if you like, you could write `data_frame=iris` instead of just `iris`.
!!! note

With static data, once the dashboard is running, the data shown in the dashboard cannot change even if the source data in `iris.csv` changes. The code `iris = pd.read_csv("iris.csv")` is only executed once when the dashboard is first started. If you would like changes to source data to flow through to the dashboard then you must use [dynamic data](dynamic-data.md).

## Reference by name

If you would like to specify your dashboard configuration through YAML then you must first add your data to the Data Manager. The value of the `data_frame` argument in the YAML configuration should then refer to the name of your data in the Data Manager.
If you would like to specify your dashboard configuration through YAML then you must first add your data to the Data Manager, importable as `vizro.managers.data_manager`. The value of the `data_frame` argument in the YAML configuration should then refer to the name of your data in the Data Manager.

!!! example "Static data referred to by name"
=== "app.py"
Expand Down Expand Up @@ -89,4 +88,4 @@ If you would like to specify your dashboard configuration through YAML then you

[DataBasic]: ../../assets/user_guides/data/data_pandas_dataframe.png

It is also possible to refer to a named data source using the Python API: `px.scatter("iris", ...)` would work if the `"iris"` data source has been registered in the Data Manager. In fact, when it comes to dynamic data, using the data source name is the _only_ way to refer to a data source.
It is also possible to refer to a named data source using the Python API: `px.scatter("iris", ...)` or `px.scatter(data_frame="iris", ...)` would work if the `"iris"` data source has been registered in the Data Manager.
2 changes: 1 addition & 1 deletion vizro-core/examples/_dev/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
df = px.data.iris()

# Cache of default_expire_data expires every 5 minutes, the default
data_manager.cache = Cache(config={"CACHE_TYPE": "FileSystemCache", "CACHE_DIR": "cache", "CACHE_DEFAULT_TIMEOUT": 20})
# data_manager.cache = Cache(config={"CACHE_TYPE": "FileSystemCache", "CACHE_DIR": "cache", "CACHE_DEFAULT_TIMEOUT": 20})
data_manager["default_expire_data"] = lambda: px.data.iris()

# Set cache of fast_expire_data to expire every 10 seconds
Expand Down
14 changes: 11 additions & 3 deletions vizro-core/src/vizro/managers/_data_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,19 @@
# * set cache to null in all other tests
# * copy returned

# TODO: __main__ in this file: remove/move to docs

#####################
# Just for the purposes of easily seeing the right debug messages. Remove before merging to main and revert to this:
# logger = logging.getLogger(__name__)
class PrefixAdapter(logging.LoggerAdapter):
def process(self, msg, kwargs):
return f"[DATA MANAGER] {msg}", kwargs

logger = logging.getLogger(__name__)

logger = PrefixAdapter(logging.getLogger(__name__))
logger.setLevel(logging.DEBUG)
#####################

# Really ComponentID and DataSourceName should be NewType and not just aliases but then for a user's code to type check
# correctly they would need to cast all strings to these types.
# TODO: remove these type aliases once have moved component to data mapping to models
Expand Down Expand Up @@ -182,7 +190,7 @@ def __init__(self):
self.__data: Dict[DataSourceName, Union[_DynamicData, _StaticData]] = {}
self.__component_to_data: Dict[ComponentID, DataSourceName] = {}
self._frozen_state = False
self.cache = Cache(config={"CACHE_TYPE": "SimpleCache"})
self.cache = Cache(config={"CACHE_TYPE": "NullCache"})
# In future, possibly we will accept just a config dict. Would need to work out whether to handle merging with
# default values though. We would do this with something like this:
# def __set_cache(self, cache_config):
Expand Down

0 comments on commit e0b8725

Please sign in to comment.