ploomber · edublancas · Aug 13, 2024 · Aug 12, 2024 · Aug 12, 2024 · Aug 12, 2024
@@ -0,0 +1,46 @@
+# Plotting large datasets in Dash
+
+Interactive Dash applications that display a figure denoting flight date and time (24h) vs flight delay time (minute). You can select the date range you want to visualize in `resampler` and `combined`.
+
+The applications plot large datasets using one of:
+- [**WebGL**](https://plotly.com/python/webgl-vs-svg/) (in `webgl` folder): a powerful technology that uses GPU to accelerate computation, helping you render figures more effectively. This method is generally ideal for figures with up to 100,000-200,000 markers (terminology for data points in charts), depending on the power of your GPU. For figures larger than that, it's often optimal to aggregate the data points first
+
+![](static/app_webgl.png)
+
+- [**`plotly-resampler`**](https://github.com/predict-idlab/plotly-resampler) (in `resampler` folder): an external library that dynamically aggregates time-series data respective to the current graph view. This approach helps you downsample your dataset at the cost of losing some details.
+
+![](static/app_resampler.png)
+
+- Combined approach (in `combined` folder).
+
+![](static/app_combined.png)
+
+We will be using a commercial flight dataset that documents information such as flight departure date/time and delays. You can find it [here](https://github.com/vega/falcon/blob/master/data/flights-3m.csv).
+
+Once you download the dataset, run `python csv-clean.py flights-3m.csv` to obtain the cleaned csv file `flights-3m-cleaned.csv`. Move the cleaned file to the `data` folder in any of the project folders (`webgl`, `resample` or  `combined`) you want to test. 
+
+## Local testing
+
+`cd` into the folder of the approach you want to test, then run `gunicorn app:server run --bind 0.0.0.0:80`. You should be able to access the app at `0.0.0.0:80`.
+
+## Upload to Ploomber Cloud
+
+Ensure that you are in the correct project folder.
+
+### Command line
+
+Go to your app folder and set your API key: `ploomber-cloud key YOURKEY`. Next, initialize your app: `ploomber-cloud init` and deploy it: `ploomber-cloud deploy`. For more details, please refer to our [documentation](https://docs.cloud.ploomber.io/en/latest/user-guide/cli.html).
+
+### UI
+
+Zip `app.py` together with `requirements.txt` and `data` folder, then upload to Ploomber Cloud. For more details, please refer to our [Dash deployment guide](https://docs.cloud.ploomber.io/en/latest/apps/dash.html).
+
+## Interacting with the App
+
+Once the app starts running, you will see a page similar to the above screenshots. You can click on the graph and drag your cursor around to zoom into any part of the graph you want. 
+
+![](static/zoom_in.gif)
+
+To revert the figure back to its original state, click on the `Reset axes` button at the upper right corner of the figure.
+
+![](static/zoom_out.gif)
@@ -0,0 +1,67 @@
+from dash import dcc, html, Input, Output, Dash
+import pandas as pd
+from datetime import datetime as dt
+import plotly.graph_objects as go
+from plotly_resampler import FigureResampler
+
+app = Dash(__name__)
+server = app.server
+
+N = 100000
+
+df = pd.read_csv("data/flights-3m-cleaned.csv")
+
+app.layout = html.Div(children=[
+    html.H1("Plotting Large Datasets in Dash"),
+    html.P("Select range of flight dates to visualize"),
+    dcc.DatePickerRange(
+        id="date-picker-select",
+        start_date=dt(2006, 1, 1),
+        end_date=dt(2006, 4, 1),
+        min_date_allowed=dt(2006, 1, 1),
+        max_date_allowed=dt(2006, 7, 1),
+        initial_visible_month=dt(2006, 1, 1),
+    ),
+    dcc.Graph(id='example-graph'),
+
+])
+
+@app.callback(
+    Output("example-graph", "figure"),
+    [
+        Input("date-picker-select", "start_date"),
+        Input("date-picker-select", "end_date"),
+    ],
+)
+def update_figure(start, end):
+    start = start + " 00:00:00"
+    end = end + " 00:00:00"
+
+    df_filtered = df[(pd.to_datetime(df["DEP_DATETIME"]) >= pd.to_datetime(start)) & \
+                     (pd.to_datetime(df["DEP_DATETIME"]) <= pd.to_datetime(end))]
+
+    fig = FigureResampler(go.Figure())
+
+    fig.add_trace(go.Scattergl(
+            mode="markers", # Replace with "line-markers" if you want to display lines between time series data.
+            showlegend=False, 
+            line_width=0.3, 
+            line_color="gray", 
+            marker={
+                "color": abs(df["DEP_DELAY"]), # Convert marker value to color.
+                "colorscale": "Portland", # How marker color changes based on data point value.
+                "size": abs(5 + df["DEP_DELAY"] / 50) # Non-negative size of individual data point marker based on the dataset.
+            }
+        ), 
+        hf_x=df_filtered["DEP_DATETIME"],
+        hf_y=df_filtered["DEP_DELAY"],
+        max_n_samples=N
+    )
+
+    fig.update_layout(
+        title="Flight departure delay",
+        xaxis_title="Flight date and time (24h)",
+        yaxis_title="Departure delay (minutes)"
+    )
+
+    return fig
@@ -0,0 +1,4 @@
+dash
+plotly-resampler
+pandas
+gunicorn
@@ -0,0 +1,27 @@
+import pandas as pd
+import sys
+
+if __name__ == "__main__":
+    if (len(sys.argv) != 2 or not sys.argv[1].endswith(".csv")):
+        raise ValueError("Usage: python csv-clean.py filename.csv")
+
+    in_file = sys.argv[1]
+    df = pd.read_csv(in_file)
+
+    # Clean out null values
+    df = df[df['DEP_TIME'].notnull() & df['DEP_DELAY'].notnull()]
+
+    # Ensure hour is between 0 and 23 for conversion
+    df.loc[df.DEP_TIME == 2400, 'DEP_TIME'] = 0
+
+    # Add time to date and convert
+    df["DEP_DATETIME"] = df["FL_DATE"] * 10000 + df["DEP_TIME"]
+    df["DEP_DATETIME"] = df["DEP_DATETIME"].apply(lambda x: pd.to_datetime(str(int(x))))
+
+    # Select relevant columns.
+    df = df[["DEP_DATETIME", "DEP_DELAY"]].sort_values(["DEP_DATETIME"])
+    print("Completed conversion. Resulting DataFrame:\n")
+    print(df)
+
+    out_file = in_file[:-4] + "-cleaned.csv"
+    df.to_csv(out_file, sep=",")
@@ -0,0 +1,65 @@
+from dash import dcc, html, Input, Output, Dash
+import pandas as pd
+from datetime import datetime as dt
+import plotly.graph_objects as go
+from plotly_resampler import FigureResampler
+
+app = Dash(__name__)
+server = app.server
+
+N = 2000
+
+df = pd.read_csv("data/flights-3m-cleaned.csv")
+
+app.layout = html.Div(children=[
+    html.H1("Plotting Large Datasets in Dash"),
+    html.P("Select range of flight dates to visualize"),
+    dcc.DatePickerRange(
+        id="date-picker-select",
+        start_date=dt(2006, 1, 1),
+        end_date=dt(2006, 4, 1),
+        min_date_allowed=dt(2006, 1, 1),
+        max_date_allowed=dt(2006, 7, 1),
+        initial_visible_month=dt(2006, 1, 1),
+    ),
+    dcc.Graph(id='example-graph'),
+
+])
+
+@app.callback(
+    Output("example-graph", "figure"),
+    [
+        Input("date-picker-select", "start_date"),
+        Input("date-picker-select", "end_date"),
+    ],
+)
+def update_figure(start, end):
+    start = start + " 00:00:00"
+    end = end + " 00:00:00"
+
+    df_filtered = df[(pd.to_datetime(df["DEP_DATETIME"]) >= pd.to_datetime(start)) & \
+                     (pd.to_datetime(df["DEP_DATETIME"]) <= pd.to_datetime(end))]
+
+    fig = FigureResampler(go.Figure())
+
+    fig.add_trace(go.Scatter(
+            mode="markers", # Replace with "line-markers" if you want to display lines between time series data.
+            showlegend=False, 
+            line_width=0.3, 
+            line_color="gray",
+            marker_size=abs(5 + df["DEP_DELAY"] / 50), # Non-negative size of individual data point marker based on the dataset.
+            marker_colorscale="Portland", # How marker color changes based on data point value.
+            marker_color=abs(df["DEP_DELAY"]), # Convert marker value to color.
+        ),
+        hf_x=df_filtered["DEP_DATETIME"],
+        hf_y=df_filtered["DEP_DELAY"],
+        max_n_samples=N
+    )
+
+    fig.update_layout(
+        title="Flight departure delay",
+        xaxis_title="Flight date and time (24h)",
+        yaxis_title="Departure delay (minutes)"
+    )
+
+    return fig
@@ -0,0 +1,4 @@
+dash
+plotly-resampler
+pandas
+gunicorn
@@ -0,0 +1,40 @@
+from dash import dcc, html, Input, Output, Dash
+from flask import request
+import pandas as pd
+import plotly.graph_objects as go
+
+app = Dash(__name__)
+server = app.server
+
+N = 100000 # Limit number of rows to plot.
+
+fig = go.Figure() # Initiate the figure.
+
+df = pd.read_csv("data/flights-3m-cleaned.csv")
+
+fig.add_trace(go.Scattergl(
+    x=df["DEP_DATETIME"][:N],
+    y=df["DEP_DELAY"][:N],
+    mode="markers", # Replace with "line-markers" if you want to display lines between time series data.
+    showlegend=False, 
+    line_width=0.3, 
+    line_color="gray", 
+    marker={
+            "color": abs(df["DEP_DELAY"][:N]), # Convert marker value to color.
+            "colorscale": "Portland", # How marker color changes based on data point value.
+            "size": abs(5 + df["DEP_DELAY"][:N] / 50) # Non-negative size of individual data point marker based on the dataset.
+        }
+    )
+)
+
+fig.update_layout(
+    title="Flight departure delay",
+    xaxis_title="Flight date and time (24h)",
+    yaxis_title="Departure delay (minutes)"
+)
+
+app.layout = html.Div(children=[
+    html.H1("Plotting Large Datasets in Dash"),
+    dcc.Graph(id='example-graph', figure=fig),
+
+])
@@ -0,0 +1,3 @@
+dash
+pandas
+gunicorn