diff --git a/docs/example-rain.ipynb b/docs/example-rain.ipynb new file mode 100644 index 0000000..3e991c1 --- /dev/null +++ b/docs/example-rain.ipynb @@ -0,0 +1,521 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a2810ee9", + "metadata": {}, + "source": [ + "# Example: Lysimeter Readings" + ] + }, + { + "cell_type": "markdown", + "id": "2efc69b3", + "metadata": {}, + "source": [ + "In this tutorial, we will be composing a four-panel plot for multiscale visualization of rainfall time series data in Texas made available by [Evett et al. via the USDA](https://doi.org/10.15482/USDA.ADC/1528713).\n", + "Our data comprises recordings from a pair of rain gauges positioned in opposite corners of the study area.\n", + "\n", + "We'll need to tackle two key challenges into visualizing this rainfall time series: *1)* dealing with scrunched time/rainfall scales and *2)* co-visualizing dueling readings from our twin gauges.\n", + "\n", + "### Challenge 1\n", + "\n", + "To address the first challenge, we will use `matplotlib`'s `stackplot` to create area plots with a transparency (\"alpha\") effect.\n", + "(For those unfamiliar, area plots are line plots with the area between the line and the x-axis filled in.)\n", + "Because the gauges mostly agree, the majority of plotted area will be overlap from both gauges.\n", + "However, where they differ one area plot will show through.\n", + "\n", + "### Challenge 2\n", + "\n", + "The second challenge in visualizing this data arises because, in the particular study area, large amounts of rain falls in short spurts.\n", + "So, when we zoom out to see the whole month and the maximum rainfall rate, large spikes in the data cause the rest of the data to be scrunched into obscurity.\n", + "\n", + "To plot our data without losing information about low-magnitude rainfall and the short-time events, we will use the `outset` package draw supplementary views of the data at alternate magnifications.\n", + "These auxiliary plots will be combined with the main plot (overall view) as an axes grid.\n" + ] + }, + { + "cell_type": "markdown", + "id": "eae5ff55", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "markdown", + "id": "7139e2c2", + "metadata": {}, + "source": [ + "Begin by importing necessary packages.\n", + "\n", + "Notably:\n", + "- `datetime` for converting the month and day values from day-of-year\n", + "- `pandas` for data management\n", + "- `matplotlib` for plotting area charts using `stackplot`\n", + "- `outset` for managing multi-zoom grid and zoom indicators" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a789ae1", + "metadata": {}, + "outputs": [], + "source": [ + "from datetime import datetime, timedelta\n", + "import itertools as it\n", + "import typing\n", + "\n", + "from matplotlib import pyplot as plt\n", + "from matplotlib import text as mpl_text\n", + "import numpy as np\n", + "import outset as otst\n", + "import opytional as opyt\n", + "import pandas as pd\n", + "import seaborn as sns" + ] + }, + { + "cell_type": "markdown", + "id": "01b3697f", + "metadata": {}, + "source": [ + "To install dependencies for this exercise,\n", + "\n", + "```bash\n", + "python3 -m pip install \\\n", + " matplotlib `# ==3.8.2`\\\n", + " numpy `# ==1.26.2` \\\n", + " outset `# ==0.1.4` \\\n", + " opytional `# ==0.1.0` \\\n", + " pandas `# ==2.1.3` \\\n", + " seaborn `# ==0.13.0`\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "65c21652", + "metadata": {}, + "source": [ + "## Data Preparation" + ] + }, + { + "cell_type": "markdown", + "id": "d69ed300", + "metadata": {}, + "source": [ + "Next, fetch our data and do a little work on it: rename columns and subset data down to just the month of March.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a6d5f41a", + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv(\"https://osf.io/6mx3e/download\") # download data\n", + "\n", + "nwls = \"NW Lysimeter\\n(35.18817624°N, -102.09791°W)\"\n", + "swls = \"SW Lysimeter\\n(35.18613985°N, -102.0979187°W)\"\n", + "df[nwls], df[swls] = df[\"NW precip in mm\"], df[\"SW precip in mm\"]\n", + "\n", + "# filter down to just data from March 2019\n", + "march_df = df[np.clip(df[\"Decimal DOY\"], 59, 90) == df[\"Decimal DOY\"]]\n", + "\n", + "march_df # show snippet of dataframe content" + ] + }, + { + "cell_type": "markdown", + "id": "7ba222ae", + "metadata": {}, + "source": [ + "Here's a preliminary look at the time series." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3b714c9f", + "metadata": {}, + "outputs": [], + "source": [ + "ax = sns.lineplot(data=march_df, x=\"Decimal DOY\", y=\"NW precip in mm\")\n", + "sns.lineplot(data=march_df, x=\"Decimal DOY\", y=\"SW precip in mm\", ax=ax)\n", + "\n", + "pass # sponge up last return value, if any" + ] + }, + { + "cell_type": "markdown", + "id": "747b0dc8", + "metadata": {}, + "source": [ + "We've certainly got some work to do to nice this up!\n", + "\n", + "Our visualization will focus on showing three details that are difficult to make out in a naive visualization *1)* a little shower around day 72, *2)* the big rainstorm around day 82, and *3)* light precipitation events over the course of the entire month.\n", + "We'll create a zoom panel to show each of these components." + ] + }, + { + "cell_type": "markdown", + "id": "973d64db", + "metadata": {}, + "source": [ + "# Setup Axes Grid" + ] + }, + { + "cell_type": "markdown", + "id": "40a42299", + "metadata": {}, + "source": [ + "Our first plotting step is to initialize an `outset.OutsetGrid` to manage content and layout of our planned axes grid.\n", + "This class operates analogously to seaborn's [`FacetGrid`](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html), if you're familiar with that.\n", + "\n", + "We'll provide a list of the main plot regions we want to magnify through the `data` kwarg.\n", + "Other kwargs provide styling and layout information, including how we want plots to be shaped and how many columns we want to have." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "be99c6e6", + "metadata": {}, + "outputs": [], + "source": [ + "grid = otst.OutsetGrid( # initialize axes grid manager\n", + " data=[\n", + " # (x0, y0, x1, y1) regions to outset\n", + " (71.6, 0, 72.2, 2), # little shower around day 72\n", + " (59, 0, 90, 0.2), # all light precipitation events\n", + " (81.3, 0, 82.2, 16), # big rainstorm around day 82\n", + " ],\n", + " x=\"Time\",\n", + " y=\"Precipitation (mm)\", # label axes\n", + " aspect=2, # make subplots wide\n", + " col_wrap=2, # wrap subplots into a 2x2 grid\n", + " # styling for zoom indicator annotations, discussed later\n", + " marqueeplot_kws={\"frame_outer_pad\": 0, \"mark_glyph_kws\": {\"zorder\": 11}},\n", + " marqueeplot_source_kws={\"zorder\": 10, \"frame_face_kws\": {\"zorder\": 10}},\n", + ")\n", + "\n", + "pass # sponge up last return value, if any" + ] + }, + { + "cell_type": "markdown", + "id": "2db75a06", + "metadata": {}, + "source": [ + "## Set Up Plot Content" + ] + }, + { + "cell_type": "markdown", + "id": "68aa0b8d", + "metadata": {}, + "source": [ + "Next, we'll set up the content of our plots --- overlapped area plots showing the two rain gauges' readings.\n", + "\n", + "Matplotlib's [`stackplot`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.stackplot.html) is designed to create area plots with areas \"stacked\" on top of each other instead of overlapping.\n", + "To get an overlap, we'll call `stackplot` twice so that each \"stack\" contains only one of our variables.\n", + "\n", + "We will use `OutsetGrid.broadcast` to draw the same content across all four axes in our grid.\n", + "This method take a plotter function as its first argument then calls it with subsequent arguments forwarded to it on each axis." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6afadc6c", + "metadata": {}, + "outputs": [], + "source": [ + "# draw semi-transparent filled lineplot on all axes for each lysimeter\n", + "for y, color in zip([nwls, swls], [\"fuchsia\", \"aquamarine\"]):\n", + " grid.broadcast(\n", + " plt.stackplot, # plotter\n", + " march_df[\"Decimal DOY\"], # all kwargs below forwarded to plotter...\n", + " march_df[y],\n", + " colors=[color],\n", + " labels=[y],\n", + " lw=2,\n", + " edgecolor=color,\n", + " alpha=0.4, # set to 60% transparent (alpha 1.0 is non-transparent)\n", + " zorder=10,\n", + " )\n", + "\n", + "pass # sponge up last return value, if any" + ] + }, + { + "cell_type": "markdown", + "id": "7ac55cf7", + "metadata": {}, + "source": [ + "To pretty things up, e'll lay down a white underlay around the stackplots for better contrast against background fills.\n", + "\n", + "We can do this by drawing another stackplot that tracks the maximum reading between our rain gauges at each point in time.\n", + "Specifying a lower `zorder` for this plot causes it to be drawn below the other stackplots." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f87f53a0", + "metadata": {}, + "outputs": [], + "source": [ + "ys = np.maximum(march_df[\"SW precip in mm\"], march_df[\"NW precip in mm\"])\n", + "grid.broadcast(\n", + " plt.stackplot, # plotter\n", + " march_df[\"Decimal DOY\"], # all kwargs below forwarded to plotter...\n", + " ys,\n", + " colors=[\"white\"],\n", + " lw=20, # thick line width causes protrusion of white border\n", + " edgecolor=\"white\",\n", + " zorder=9, # note lower zorder positions underlay below stackplots\n", + ")\n", + "\n", + "display(grid.figure) # show current progress\n", + "\n", + "pass # sponge up last return value, if any" + ] + }, + { + "cell_type": "markdown", + "id": "73a187d9", + "metadata": {}, + "source": [ + "# Add Zoom Indicators" + ] + }, + { + "cell_type": "markdown", + "id": "5ce10ffc", + "metadata": {}, + "source": [ + "Now it's time to add zoom indicator boxes, a.k.a. `outset` \"marquees,\" to show how the scales of our auxiliary plots relate to the scale of the main plot.\n", + "Note that we pass a kwarg to allow aspect ratios to vary between the main plot and outset plots.\n", + "That way, zoom areas can be expanded along their smaller dimension to take full advantage of available space." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8650b5cb", + "metadata": {}, + "outputs": [], + "source": [ + "# draw \"marquee' zoom indicators showing correspondences between main plot\n", + "# and outset plots\n", + "grid.marqueeplot(equalize_aspect=False) # allow axes aspect ratios to vary\n", + "\n", + "display(grid.figure) # show current progress\n", + "\n", + "pass # sponge up last return value, if any" + ] + }, + { + "cell_type": "markdown", + "id": "dde85fc7", + "metadata": {}, + "source": [ + "## Replace Numeric Tick Labels with Human-readable Timestamps" + ] + }, + { + "cell_type": "markdown", + "id": "8b108c82", + "metadata": {}, + "source": [ + "We're almost there!\n", + "But the x-axis tick labels are still numeric \"day of year\" values, which is not very intuitive.\n", + "I don't know off the top of my head what day 42 of the year corresponds to, do you?\n", + "\n", + "Let's fix that.\n", + "To replace the existing tick labels with timestamps, we'll define a function that takes a numeric day of the year and returns a human-readable timestamp.\n", + "We'll always include the time of day, but we'll only include the date on between-day transitions.\n", + "We'll also need helper function to convert numeric day of the year to a Python datetime object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e4d21624", + "metadata": {}, + "outputs": [], + "source": [ + "# helper function\n", + "def to_dt(day_of_year: float, year: int = 2019) -> datetime:\n", + " \"\"\"Convert decimal day of the year to a datetime object.\"\"\"\n", + " return datetime(year=year, month=1, day=1) + timedelta(days=day_of_year)\n", + "\n", + "\n", + "def format_tick_value(\n", + " prev_value: typing.Optional[mpl_text.Text],\n", + " value: mpl_text.Text,\n", + ") -> str:\n", + " \"\"\"Create human-readable timestamp to replace a numeric axes tick.\n", + "\n", + " Adjust date string content based on whether previous tick is a different\n", + " calendar day than the current tick.\n", + " \"\"\"\n", + " decimal_doy = float(value.get_text())\n", + " prev_decimal_doy = opyt.apply_if(prev_value, lambda x: float(x.get_text()))\n", + "\n", + " # if the previous tick is the same day as this tick...\n", + " # (note: first tick has no previous tick, so prev_decimal_day is None)\n", + " if int(decimal_doy) == opyt.apply_if(prev_decimal_doy, int):\n", + " # ... then just label with time of day\n", + " return to_dt(decimal_doy).strftime(\"%H:%M\")\n", + " # otherwise, if prev tick is different day AND this tick near midnight...\n", + " elif decimal_doy % 1.0 < 1 / 24:\n", + " # ... then just label date and not time of day\n", + " return to_dt(decimal_doy).strftime(\"%b %-d\")\n", + " # otherwise, prev tick is different day and this tick NOT near midnight...\n", + " else:\n", + " # ... label with time of day AND date\n", + " return to_dt(decimal_doy).strftime(\"%H:%M\\n%b %-d\")" + ] + }, + { + "cell_type": "markdown", + "id": "6a77a7d7", + "metadata": {}, + "source": [ + "With this out of the way, we can loop over the axes in our grid and perform the label replacement." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0bd83292", + "metadata": {}, + "outputs": [], + "source": [ + "for ax in grid.axes.flat: # iterate over all axes in grid\n", + " # first, filter x ticks to keep only ticks that are within axes limits\n", + " ax.set_xticks(\n", + " [val for val in ax.get_xticks() if np.clip(val, *ax.get_xlim()) == val]\n", + " )\n", + " # second, map format_tick_value over tick positions to make tick timestamps\n", + " new_tick_labels = it.starmap(\n", + " format_tick_value, # mapped function\n", + " # ... taking sequential tick pairs as arguments,\n", + " # front-padded with None (first tick has no preceding tick)\n", + " it.pairwise((None, *ax.get_xticklabels())),\n", + " )\n", + " # third, replace existing tick labels with new timestamp tick labels\n", + " ax.set_xticklabels([*new_tick_labels]) # make list b/c mpl rejects iters\n", + "\n", + "display(grid.figure) # show current progress\n", + "\n", + "pass # sponge up last return value, if any" + ] + }, + { + "cell_type": "markdown", + "id": "0b865fcb", + "metadata": {}, + "source": [ + "## Final Details" + ] + }, + { + "cell_type": "markdown", + "id": "17f62773", + "metadata": {}, + "source": [ + "The last order of business is to add a legend to the upper left corner of the main plot." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "394527b0", + "metadata": {}, + "outputs": [], + "source": [ + "grid.source_axes.legend( # add legend to primary axes\n", + " loc=\"upper left\",\n", + " bbox_to_anchor=(0.02, 1.0), # legend positioning\n", + " frameon=True, # styling: turn on legend frame\n", + ")\n", + "\n", + "pass # sponge up last return value, if any" + ] + }, + { + "cell_type": "markdown", + "id": "30ea984d", + "metadata": {}, + "source": [ + "# Et Voilà!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47fc94f2", + "metadata": {}, + "outputs": [], + "source": [ + "display(grid.figure) # show completed figure!\n", + "\n", + "pass # sponge up last return value, if any" + ] + }, + { + "cell_type": "markdown", + "id": "97b9aa30", + "metadata": {}, + "source": [ + "## Citations" + ] + }, + { + "cell_type": "markdown", + "id": "f60ff9d5", + "metadata": {}, + "source": [ + "> Evett, Steven R.; Marek, Gary W.; Copeland, Karen S.; Howell, Terry A. Sr.; Colaizzi, Paul D.; Brauer, David K.; Ruthardt, Brice B. (2023). Evapotranspiration, Irrigation, Dew/frost - Water Balance Data for The Bushland, Texas Soybean Datasets. Ag Data Commons. https://doi.org/10.15482/USDA.ADC/1528713. Accessed 2023-12-26.\n", + "\n", + "> J. D. Hunter, \"Matplotlib: A 2D Graphics Environment\", Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007. https://doi.org/10.1109/MCSE.2007.55\n", + "\n", + "> Marek, G. W., Evett, S. R., Colaizzi, P. D., & Brauer, D. K. (2021). Preliminary crop coefficients for late planted short-season soybean: Texas High Plains. Agrosystems, Geosciences & Environment, 4(2). https://doi.org/10.1002/agg2.20177\n", + "\n", + "> Matthew Andres Moreno. (2023). mmore500/outset. Zenodo. https://doi.org/10.5281/zenodo.10426106\n", + "\n", + "> Data structures for statistical computing in python, McKinney, Proceedings of the 9th Python in Science Conference, Volume 445, 2010. https://doi.org/ 10.25080/Majora-92bf1922-00a\n", + "\n", + "> Waskom, M. L., (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60), 3021, https://doi.org/10.21105/joss.03021." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/example-wealth.ipynb b/docs/example-wealth.ipynb new file mode 100644 index 0000000..8dd3d40 --- /dev/null +++ b/docs/example-wealth.ipynb @@ -0,0 +1,470 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "27f43f21", + "metadata": {}, + "source": [ + "# Example: Billionaire Demographics" + ] + }, + { + "cell_type": "markdown", + "id": "11cc74b9", + "metadata": {}, + "source": [ + "This data visualization tutorial tackles a common pair of data visualization objectives: *1)* showing how data between categories relate and *2)* showing how data within each category are structured.\n", + "\n", + "To this end, we'll build a visualization for the demographics of the most wealthy members of different industries.\n", + "Data for this exercise comes from [Joy Shill via Kaggle](https://www.kaggle.com/datasets/joyshil0599/exploring-wealth-forbes-richest-people-dataset).\n", + "\n", + "In our visualization, we will look at how the demographics of the upper echelons of different industries are structured and how that structure compares between industries.\n", + "Scatterploting age versus wealth faceted per industry will provide a good view of individual industries' top-tier compositions.\n", + "To aid comparison of demographics between industries, we'll provide a summary plot and structure our industry-specific facets zoomed outsets of the summary plot.\n", + "We'll arrange the main plot and the zoom outsets together as an axes grid.\n", + "\n", + "To take on this latter task --- composing a primary axes with a faceted grid of zoomed views --- we'll make use of `outset` library, which provides convenient tools to this end.\n" + ] + }, + { + "cell_type": "markdown", + "id": "ce4c5068", + "metadata": {}, + "source": [ + "Let's start coding!" + ] + }, + { + "cell_type": "markdown", + "id": "9e0677d1", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "markdown", + "id": "fe3126bc", + "metadata": {}, + "source": [ + "Begin by importing necessary packages.\n", + "\n", + "Notably:\n", + "- `adjustText` for nudging scatterplot text labels to prevent overlap\n", + "- `matplotlib` as the graphics engine\n", + "- `outset` for managing multi-zoom grid and zoom indicators\n", + "- `pandas` for data management" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "676f9fd7", + "metadata": {}, + "outputs": [], + "source": [ + "import typing\n", + "\n", + "from adjustText import adjust_text\n", + "import outset as otst\n", + "from outset import patched as otst_patched\n", + "from matplotlib import pyplot as plt\n", + "import pandas as pd\n", + "\n", + "plt.style.use(\"bmh\") # aesthetics: switch matplotlib style sheet" + ] + }, + { + "cell_type": "markdown", + "id": "6ad12b89", + "metadata": {}, + "source": [ + "To install dependencies for this exercise,\n", + "\n", + "```bash\n", + "python3 -m pip install \\\n", + " adjustText `# ==0.8` \\\n", + " matplotlib `# ==3.8.2`\\\n", + " numpy `# ==1.26.2` \\\n", + " outset `# ==0.1.4` \\\n", + " opytional `# ==0.1.0` \\\n", + " pandas `# ==2.1.3`\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "4221755d", + "metadata": {}, + "source": [ + "## Data Preparation" + ] + }, + { + "cell_type": "markdown", + "id": "8d48041b", + "metadata": {}, + "source": [ + "Next, let's fetch our data and do a little work on it.\n", + "We'll need to clean up the \"Net Worth\" column, which is polluted with dollar signs and textual \"billion\" qualifiers.\n", + "We'll also shorten the names of the people in our dataset so they're easier to plot and rank everyone within their own industry in order to isolate the upper echelons of each industry." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d86dbbfa", + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv(\"https://osf.io/bvrjm/download\") # download data\n", + "df[\"Net Worth (Billion USD)\"] = (\n", + " # strip out pesky non-numeric characters then convert to float\n", + " df[\"Net Worth\"]\n", + " .str.replace(r\"[^.0-9]+\", \"\", regex=True)\n", + " .astype(float)\n", + ")\n", + "df[\"Who\"] = ( # shorten names so they're easier to plot...\n", + " df[\"Name\"]\n", + " .str.replace(r\"\\s*&.*\", \"\", regex=True) # ... remove \"& family\"\n", + " .replace(r\"(\\b[A-Za-z])\\w+\\s+\", r\"\\1. \", regex=True) # abbrev F. Lastname\n", + " .str.slice(0, 12) # chop long names to 12 characters\n", + ")\n", + "\n", + "# rank everyone by wealth within their industry\n", + "df[\"Industry Rank\"] = df.groupby(\"Industry\")[\"Net Worth (Billion USD)\"].rank(\n", + " method=\"dense\", ascending=False\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "cbbe0f27", + "metadata": {}, + "source": [ + "For tractability, we will visualize only a subset of industry categories.\n", + "Here's a few with interesting contrast in demographic structure." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1ac577c5", + "metadata": {}, + "outputs": [], + "source": [ + "focal_industries = [\n", + " \"Technology\",\n", + " \"Fashion & Retail\",\n", + " \"Sports\",\n", + " \"Finance & Investments\",\n", + " \"Automotive\",\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "fb6e69cc", + "metadata": {}, + "source": [ + "# Setup Axes Grid" + ] + }, + { + "cell_type": "markdown", + "id": "cb77df05", + "metadata": {}, + "source": [ + "Now it's time to get started plotting.\n", + "We can use an the outset library's `OutsetGrid` class to manage content and layout of our faceted axes grid.\n", + "This tool operates analogously to seaborn's [`FacetGrid`](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html), if you're familiar with that.\n", + "\n", + "We pass `OutsetGrid` a data frame containing our data, the names of the columns we want to use for the x and y axes, and the names of the columns we want to use to split our data into facets.\n", + "Passing the `hue` kwarg will supplement faceting with color-coding by industry.\n", + "Other kwargs provide styling and layout information, including how we want plots to be shaped and how many columns we want to have." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f616bc3f", + "metadata": {}, + "outputs": [], + "source": [ + "grid = otst.OutsetGrid( # setup axes grid manager\n", + " df[(df[\"Industry Rank\"] < 8)].dropna(), # only top 8 in each industry\n", + " x=\"Age\",\n", + " y=\"Net Worth (Billion USD)\",\n", + " col=\"Industry\",\n", + " hue=\"Industry\",\n", + " col_order=focal_industries, # subset to our focal industries\n", + " hue_order=focal_industries,\n", + " aspect=1.5, # widen subplots\n", + " col_wrap=3,\n", + ")\n", + "\n", + "pass # sponge up last return value, if any" + ] + }, + { + "cell_type": "markdown", + "id": "8092cbee", + "metadata": {}, + "source": [ + "## Set Up Plot Content" + ] + }, + { + "cell_type": "markdown", + "id": "c4e5ea3f", + "metadata": {}, + "source": [ + "Next, we'll scatterplot our billionaire's ages and wealths.\n", + "We need to add each billionaire to the main plot and to their industry's outset plot.\n", + "\n", + "We'll use a patched version of seaborn's `scatterplot` function bundled with `outset` due to an [open seaborn issue](https://github.com/mwaskom/seaborn/issues/3601).\n", + "We will use `OutsetGrid.map_dataframe` to plot appropriate data and colors on each axes.\n", + "The `map_dataframe` method works analogously it its equivalent on seaborn's `FacetGrid.\n", + "The first argument provides a plotter function, which is called with subsequent arguments forwarded on each axes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c6bb1eb3", + "metadata": {}, + "outputs": [], + "source": [ + "grid.map_dataframe( # map scatterplot over all axes\n", + " otst_patched.scatterplot,\n", + " x=\"Age\",\n", + " y=\"Net Worth (Billion USD)\",\n", + " legend=False,\n", + ")\n", + "\n", + "display(grid.figure) # show current progress\n", + "\n", + "pass # sponge up last return value, if any" + ] + }, + { + "cell_type": "markdown", + "id": "e7896b67", + "metadata": {}, + "source": [ + "# Annotate Outset Plots with Names" + ] + }, + { + "cell_type": "markdown", + "id": "fb38c597", + "metadata": {}, + "source": [ + "Our next task is to identify plotted billionaires by name by labeling scatter points on outset plots.\n", + "To get a good result, some care must be taken to avoid overlapping text labels with each other or with the plotted points.\n", + "The awesome [adjustText](https://github.com/Phlya/adjustText) library can help us here by automatically nudging text labels to avoid overlap.\n", + "\n", + "Let's put together a data-oriented helper function to do all this that we can map over the outset axes in our grid." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4df99682", + "metadata": {}, + "outputs": [], + "source": [ + "def annotateplot(\n", + " data: pd.DataFrame,\n", + " *,\n", + " x: str,\n", + " y: str,\n", + " text: str,\n", + " ax: typing.Optional[plt.Axes] = None,\n", + " adjusttext_kws: typing.Optional[typing.Mapping] = None,\n", + " **kwargs: dict,\n", + ") -> plt.Axes:\n", + " \"\"\"Annotate a plot coordinates with text labels, then apply adjustText to\n", + " rearrange the labels to avoid overlaps.\n", + "\n", + " Parameters\n", + " ----------\n", + " data : pd.DataFrame\n", + " The DataFrame containing the data to plot.\n", + " x : str\n", + " The name of the column in `data` to use for the x-axis values.\n", + " y : str\n", + " The name of the column in `data` to use for the y-axis values.\n", + " text : Optional[str], default None\n", + " The name of the column in `data` to use for text values.\n", + " ax : Optional[plt.Axes], default None\n", + " The matplotlib Axes object to draw the plot onto, if provided.\n", + " adjusttext_kws : Mapping, default {}\n", + " Additional keyword arguments forward to adjustText.\n", + " **kwargs : dict\n", + " Additional keyword arguments forward to seaborn's regplot.\n", + "\n", + " Returns\n", + " -------\n", + " plt.Axes\n", + " The matplotlib Axes containing the plot.\n", + " \"\"\"\n", + " if adjusttext_kws is None:\n", + " adjusttext_kws = {}\n", + "\n", + " if ax is None:\n", + " ax = plt.gca()\n", + "\n", + " kwargs.pop(\"legend\", None) # ignore these kwargs\n", + " kwargs.pop(\"label\", None)\n", + "\n", + " texts = [ # add text for each row in data\n", + " ax.text(row[x], row[y], row[text], **kwargs)\n", + " for _idx, row in data.iterrows()\n", + " ]\n", + " adjust_text(texts, ax=ax, **adjusttext_kws)\n", + "\n", + " return ax" + ] + }, + { + "cell_type": "markdown", + "id": "0d98742f", + "metadata": {}, + "source": [ + "A call to `OutsetGrid.map_dataframe_outset` applies this function to just the outset axes in our grid." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e8106b0", + "metadata": {}, + "outputs": [], + "source": [ + "grid.map_dataframe_outset( # map name annotations over all outset axes\n", + " otst_patched.annotateplot,\n", + " x=\"Age\",\n", + " y=\"Net Worth (Billion USD)\",\n", + " text=\"Who\",\n", + " fontsize=8, # make text slightly smaller so it's easier to lay out\n", + " rotation=-5,\n", + " adjusttext_kws=dict( # tweak fiddly params for text layout solver\n", + " avoid_self=True,\n", + " autoalign=True,\n", + " expand_points=(1.8, 1.3),\n", + " arrowprops=dict(arrowstyle=\"-\", color=\"k\", lw=0.5),\n", + " expand_text=(1.8, 2),\n", + " force_points=(1.2, 1),\n", + " ha=\"center\",\n", + " va=\"top\",\n", + " ),\n", + ")\n", + "\n", + "display(grid.figure) # show current progress\n", + "\n", + "pass # sponge up last return value, if any" + ] + }, + { + "cell_type": "markdown", + "id": "adb56953", + "metadata": {}, + "source": [ + "# Add Zoom Indicators" + ] + }, + { + "cell_type": "markdown", + "id": "fe34dc5f", + "metadata": {}, + "source": [ + "Now it's time to add zoom indicator boxes, a.k.a. `outset` \"marquees,\" to show how the scales of our auxiliary plots relate to the scale of the main plot.\n", + "Note that we pass the `equalize_aspect` kwarg so that aspect ratios can vary between the main plot and outset plots.\n", + "That way, zoom areas will expand to take full advantage of available space." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41fd80f0", + "metadata": {}, + "outputs": [], + "source": [ + "# draw \"marquee' zoom indicators showing correspondences between main plot\n", + "# and outset plots\n", + "grid.marqueeplot(equalize_aspect=False) # allow axes aspect ratios to vary\n", + "\n", + "display(grid.figure) # show current progress\n", + "\n", + "pass # sponge up last return value, if any" + ] + }, + { + "cell_type": "markdown", + "id": "72abca13", + "metadata": {}, + "source": [ + "# Et Voilà!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2a806b5a", + "metadata": {}, + "outputs": [], + "source": [ + "display(grid.figure) # show completed figure!\n", + "\n", + "pass # sponge up last return value, if any" + ] + }, + { + "cell_type": "markdown", + "id": "0b6bbc64", + "metadata": {}, + "source": [ + "## Citations" + ] + }, + { + "cell_type": "markdown", + "id": "eef970c6", + "metadata": {}, + "source": [ + "> Ilya Flyamer, Zhuyi Xue, Colin, Andy Li, JasonMendoza2008, Josh L. Espinoza, Nader Morshed, Oscar Gustafsson, Victor Vazquez, Ryan Neff, mski_iksm, Nikita Vaulin, scaine1, & Oliver Lee. (2023). Phlya/adjustText: 0.8.1 (0.8.1). Zenodo. https://doi.org/10.5281/zenodo.10016869\n", + "\n", + "> J. D. Hunter, \"Matplotlib: A 2D Graphics Environment\", Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007. https://doi.org/10.1109/MCSE.2007.55\n", + "\n", + "> Matthew Andres Moreno. (2023). mmore500/outset. Zenodo. https://doi.org/10.5281/zenodo.10426106\n", + "\n", + "> Data structures for statistical computing in python, McKinney, Proceedings of the 9th Python in Science Conference, Volume 445, 2010. https://doi.org/ 10.25080/Majora-92bf1922-00a\n", + "\n", + "> Joy Shil. (2023). Exploring Wealth: Forbes Richest People Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/5272751\n", + "\n", + "> Waskom, M. L., (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60), 3021, https://doi.org/10.21105/joss.03021." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/index.rst b/docs/index.rst index 7cabac7..a6ebbcc 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -13,6 +13,8 @@ Overview index gallery quickstart + example-rain + example-wealth api contributing package