From 92f1a89c8ab0ee8c911042d2d7243ef1d12a6871 Mon Sep 17 00:00:00 2001 From: ncclementi Date: Tue, 24 Sep 2024 18:58:02 -0400 Subject: [PATCH] chore: udpate after duckdb 1.1.1 release --- .../index/execute-results/html.json | 4 +- docs/posts/ibis-overturemaps/index.qmd | 163 +++++++++--------- 2 files changed, 86 insertions(+), 81 deletions(-) diff --git a/docs/_freeze/posts/ibis-overturemaps/index/execute-results/html.json b/docs/_freeze/posts/ibis-overturemaps/index/execute-results/html.json index c67ac07b2ecf..456caf8be09f 100644 --- a/docs/_freeze/posts/ibis-overturemaps/index/execute-results/html.json +++ b/docs/_freeze/posts/ibis-overturemaps/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "dfab10a1fc121c101991cf67117a23ff", + "hash": "db193291d3a9a736f5805388df94bcb2", "result": { "engine": "jupyter", - "markdown": "---\ntitle: \"From query to plot: Exploring GeoParquet Overture Maps with Ibis, DuckDB, and Lonboard\"\nauthor: Naty Clementi and Kyle Barron\ndate: 2024-09-13\ncategories:\n - blog\n - duckdb\n - overturemaps\n - lonboard\n - geospatial\n---\n\n\nWith the release of `DuckDB 1.1.0`, now we have support for reading GeoParquet\nfiles! With this exciting update we can query rich datasets from Overture Maps using python via Ibis with the performance of `DuckDB`.\n\nBut the good news don't stop there, since `Ibis 9.2`, `lonboard` can plot data directly from an `Ibis` table, adding more simplicity and speed to your geospatial analysis.\n\nLet’s dive into how these tools come together.\n\n## Installation\n\nInstall Ibis with the dependencies needed to work with geospatial data using DuckDB. To be able to read geoparquet files the duckdb version should be `>=1.1.0`.\n\n::: {.callout-note}\nAt the moment duckdb 1.1.0 has a bug that prevents us from querying and writing the data to parquet using Ibis and DuckDB, so we are installing the `overturemaps` CLI to get the data.\n:::\n\n```bash\n$ pip install 'ibis-framework[duckdb,geospatial]' lonboard overturemaps\n```\n\n## Motivation\n\nOverture Maps offers a variety of datasets to query, but we thought that it would\nbe interesting to see some plots related to the power infrastructure. We'll look\ninto power plants, and power lines of most of the USA (excluding territories and\nAlaska for simplicity of the bounding box).\n\n## Download data\n\n**NOTE: not sure if I should have this here or avoid it because it's broken**\n\nWith Ibis and DuckDB we could download only the necessary columns needed but at\nthe moment this fails due to a bug.\n\nFor future reference when this gets fixed, the expression would look like this:\n\n```python\nimport ibis\nfrom ibis import _\n\ncon = ibis.get_backend()\n\n# look into type infrastructure\nurl = \"s3://overturemaps-us-west-2/release/2024-07-22.0/theme=base/type=infrastructure/*\"\nt = con.read_parquet(url, table_name=\"infra-usa\")\n\n# filter for USA bounding box, subtype=\"power\", and selecting only few columns\nexpr = t.filter(_.bbox.xmin > -125.0, _.bbox.ymin > 24.8,\n _.bbox.xmax < -65.8, _.bbox.ymax < 49.2,\n _.subtype==\"power\"\n ).select([\"names\",\n \"geometry\",\n \"class\",\n \"sources\",\n \"source_tags\"])\n\n```\n\n::: {.callout-note}\nIf you inspect expr, you can see that the filters and projections get pushed down\nmeaning you only bring the data that you asked for.\n:::\n\n```python\n# to write it to a parquet file we would do (currently broken)\ncon.to_parquet(expr, \"power-infra-usa.parquet\")\n```\n\nBut, no worries! Overture Maps has a nice [python CLI](https://docs.overturemaps.org/getting-data/overturemaps-py/) that allow us to download\nsome the data, we can filter by bounding box and type but any other filters will have to happen afterwards.\n\n```bash\n$ overturemaps download --bbox=-125.0,24.8,-65.8,49.2 -f geoparquet --type=infrastructure -o usa-infra.geoparquet\n```\n\nNow that we have the data lets explore it in Ibis interactive mode and make some beautiful maps.\n\n## Data exploration\n\n::: {#09d3591a .cell execution_count=1}\n``` {.python .cell-code}\nimport ibis\nfrom ibis import _\n\nibis.options.interactive = True\ncon = ibis.get_backend() # default duckdb backend\n```\n:::\n\n\n::: {#2905df8d .cell execution_count=2}\n``` {.python .cell-code}\nusa_infra = con.read_parquet(\"usa-infra.geoparquet\")\nusa_infra\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ id                                geometry                                                                          bbox                                                                version  sources                                                                           subtype  class   surface  names                                                                             level  source_tags             wikidata ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩\n│ stringgeospatial:geometrystruct<xmin: float32, xmax: float32, ymin: float32, ymax: float32>int32array<struct<property: string, dataset: string, record_id: string, update_time:…stringstringstringstruct<primary: string, common: map<string, string>, rules: array<struct<varian…int32map<string, string>string   │\n├──────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────┼─────────┼──────────────────────────────────────────────────────────────────────────────────┼─────────┼────────┼─────────┼──────────────────────────────────────────────────────────────────────────────────┼───────┼────────────────────────┼──────────┤\n│ 08b4872910712fff0001bd2a98181d63<LINESTRING (-112.751 26.443, -112.752 26.443)>{'xmin': -112.75163269042969, 'xmax': -112.75121307373047, ... +2}0[{...}]barrierfence NULLNULLNULL{'barrier': 'fence'}NULL     │\n│ 08b48729107adfff0001be94b7094f98<LINESTRING (-112.752 26.443, -112.752 26.443, -112.752 26.443, -112.752 26....>{'xmin': -112.75247192382812, 'xmax': -112.75128936767578, ... +2}0[{...}]barrierfence NULLNULLNULL{'barrier': 'fence'}NULL     │\n│ 08b48729107acfff0001b9d2fa063cd0<LINESTRING (-112.752 26.443, -112.752 26.443, -112.752 26.443)>{'xmin': -112.7518310546875, 'xmax': -112.75159454345703, ... +2}0[{...}]barrierfence NULLNULLNULL{'barrier': 'fence'}NULL     │\n│ 08b48729107acfff0001b2a999988dee<LINESTRING (-112.752 26.443, -112.752 26.443, -112.752 26.443, -112.752 26....>{'xmin': -112.75196075439453, 'xmax': -112.75128936767578, ... +2}0[{...}]barrierfence NULLNULLNULL{'barrier': 'fence'}NULL     │\n│ 08b48729107acfff0001b78a05e4b26d<LINESTRING (-112.752 26.443, -112.751 26.443)>{'xmin': -112.7516098022461, 'xmax': -112.75145721435547, ... +2}0[{...}]barrierfence NULLNULLNULL{'barrier': 'fence'}NULL     │\n│ 08b48729107acfff0001b4f0fb8a8add<LINESTRING (-112.751 26.443, -112.751 26.443, -112.751 26.443)>{'xmin': -112.75147247314453, 'xmax': -112.75127410888672, ... +2}0[{...}]barrierfence NULLNULLNULL{'barrier': 'fence'}NULL     │\n│ 08b48729107a1fff0001b83d857e5f12<LINESTRING (-112.751 26.443, -112.751 26.443)>{'xmin': -112.75128936767578, 'xmax': -112.75121307373047, ... +2}0[{...}]barrierfence NULLNULLNULL{'barrier': 'fence'}NULL     │\n│ 08b4872910785fff0001bdb9d542ab99<LINESTRING (-112.751 26.443, -112.751 26.443, -112.751 26.444, -112.75 26.443)>{'xmin': -112.75137329101562, 'xmax': -112.75019836425781, ... +2}0[{...}]barrierfence NULLNULLNULL{'barrier': 'fence'}NULL     │\n│ 08b4872910780fff0001b7f776ed2f65<LINESTRING (-112.751 26.444, -112.751 26.444, -112.751 26.444, -112.75 26.4...>{'xmin': -112.75130462646484, 'xmax': -112.74915313720703, ... +2}0[{...}]barrierfence NULLNULLNULL{'barrier': 'fence'}NULL     │\n│ 08b48766e2a1dfff0001bf9082d6913e<LINESTRING (-112.512 26.374, -112.516 26.368)>{'xmin': -112.51585388183594, 'xmax': -112.51177978515625, ... +2}0[{...}]airportrunwayunpaved{'primary': '03/21', 'common': None, ... +1}NULL{'surface': 'unpaved'}NULL     │\n│         │\n└──────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────┴─────────┴──────────────────────────────────────────────────────────────────────────────────┴─────────┴────────┴─────────┴──────────────────────────────────────────────────────────────────────────────────┴───────┴────────────────────────┴──────────┘\n
\n```\n:::\n:::\n\n\nWe take a look at the subtypes of infrastructure:\n\n::: {#3d535ba3 .cell execution_count=3}\n``` {.python .cell-code}\nusa_infra.subtype.value_counts().preview(max_rows=15)\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n┃ subtype           subtype_count ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n│ stringint64         │\n├──────────────────┼───────────────┤\n│ utility         522731 │\n│ bridge          837941 │\n│ aerialway       29681 │\n│ airport         139602 │\n│ communication   85119 │\n│ pier            173408 │\n│ tower           44226 │\n│ recreation      26958 │\n│ power           7188023 │\n│ transit         2609816 │\n│ barrier         3236898 │\n│ pedestrian      443787 │\n│ waste_management133733 │\n│ water           125102 │\n│ manhole         147852 │\n└──────────────────┴───────────────┘\n
\n```\n:::\n:::\n\n\nIf we want to look at power lines we need to look into `subtype==\"power\"`\nand see what kind of `class` we have in there.\n\n::: {#750a4c9a .cell execution_count=4}\n``` {.python .cell-code}\n(usa_infra.filter(_.subtype==\"power\")[\"class\"]\n .value_counts()\n .order_by(ibis.desc(\"class_count\"))\n )\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ class        class_count ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ stringint64       │\n├─────────────┼─────────────┤\n│ power_pole 3007451 │\n│ power_tower2608933 │\n│ generator  952415 │\n│ power_line 252060 │\n│ minor_line 114593 │\n│ substation 65364 │\n│ portal     59237 │\n│ switch     52279 │\n│ transformer51710 │\n│ plant      12495 │\n│  │\n└─────────────┴─────────────┘\n
\n```\n:::\n:::\n\n\nLooks like we have `plants`, `power_lines` and `minor_lines`, so we can make get some nice maps.\n\n::: {#ecb97136 .cell execution_count=5}\n``` {.python .cell-code}\nplants = usa_infra.filter(_.subtype==\"power\", usa_infra[\"class\"]==\"plant\")\npower_lines = usa_infra.filter(_.subtype==\"power\", usa_infra[\"class\"]==\"power_line\")\nminor_lines = usa_infra.filter(_.subtype==\"power\", usa_infra[\"class\"]==\"minor_line\")\n```\n:::\n\n\n## Plotting\n\n**Note: maybe here explain why lonboard ?**\n\n::: {.callout-note}\nYou can try this in your machine, for the purpose the blog file size, we will show\nscreenshots of the visualization\n:::\n\n```python\nimport lonboard\nfrom lonboard.basemap import CartoBasemap # to choose color of basemap\n```\n\nLet's visualize the `power plants`\n\n```python\nlonboard.viz(plants,\n scatterplot_kwargs={\"get_fill_color\": \"red\"},\n polygon_kwargs={\"get_fill_color\": \"red\"},\n map_kwargs={\"basemap_style\": CartoBasemap.Positron,\n \"view_state\": {\"longitude\": -100, \"latitude\": 36, \"zoom\": 3}\n })\n```\n\n![Power plants in the USA](usa-power-plants.png)\n\nIf you are visualizing this in your machine, you can zoom in and see some of the \ngeometry where the plants are located. But as an example, we can plot in a small \narea of California:\n\n```python\nplants_CA = plants.filter(_.bbox.xmin.between(-118.6, -117.9),\n _.bbox.ymin.between(34.5, 35.3))[_.names.primary, _.geometry]\n```\n\n```python\nlonboard.viz(plants_CA,\n scatterplot_kwargs={\"get_fill_color\": \"red\"},\n polygon_kwargs={\"get_fill_color\": \"red\"},\n map_kwargs={\"basemap_style\": CartoBasemap.Positron,\n })\n```\n\n![Power plants near Lancaster CA](ca-power-plants.png)\n\nWe can also visualize together the `power_lines` and the `minor_lines` by doing:\n\n\n```python\nlonboard.viz([minor_lines, power_lines])\n```\n\n![Minor and Power lines of USA](usa-power-and-minor-lines.png)\n\nand that's how you can visualize ~7 million points from the comfort of\nyour laptop.\n\nNote: I got the ~7M by adding the number of points in power_lines and minor_lines. I'm not sure if plotting the lines that connect these points add points to this.\n\n```python\n>>> power_lines.geometry.n_points().sum()\n5329836\n>>> minor_lines.geometry.n_points().sum()\n1430042\n```\n\nWith Ibis and DuckDB working with geospatial data has never been easier or faster.\nWe saw how to query a dataset from Overture Maps with the simplicity of Python and\nthe performance of DuckDB. Last but not least, we saw how simple and quick lonboard\ngot us from query-to-plot. Together, these libraries make exploring and handling\ngeospatial data a breeze.\n\n\n## Resources\n- [Ibis Docs](https://ibis-project.org/)\n- [Lonboard Docs](https://developmentseed.org/lonboard/latest/)\n- [DuckDB spatial extension](https://duckdb.org/docs/extensions/spatial.html)\n- [DuckDB spatial functions docs](https://github.com/duckdb/duckdb_spatial/blob/main/docs/functions.md)\n\nChat with us on Zulip:\n\n- [Ibis Zulip Chat](https://ibis-project.zulipchat.com/)\n\n", + "markdown": "---\ntitle: \"From query to plot: Exploring GeoParquet Overture Maps with Ibis, DuckDB, and Lonboard\"\nauthor: Naty Clementi and Kyle Barron\ndate: 2024-09-13\ncategories:\n - blog\n - duckdb\n - overturemaps\n - lonboard\n - geospatial\n---\n\n\nWith the release of `DuckDB 1.1.1`, now we have support for reading GeoParquet\nfiles! With this exciting update we can query rich datasets from Overture Maps\nusing python via Ibis with the performance of `DuckDB`.\n\nBut the good news doesn't stop there, since `Ibis 9.2`, `lonboard` can plot data\ndirectly from an `Ibis` table, adding more simplicity and speed to your\ngeospatial analysis.\n\nLet’s dive into how these tools come together.\n\n## Installation\n\nFirst make sure you have `duckdb=1.1.1`, then install Ibis with the dependencies\nneeded to work with geospatial data using DuckDB. To be able to read GeoParquet\nfiles the DuckDB version should be `>=1.1.1`.\n\n```bash\n$ pip install duckdb==1.1.1\n$ pip install 'ibis-framework[duckdb,geospatial]' lonboard\n```\n\n## Motivation\n\nOverture Maps is an open-source initiative that provides high-quality,\ninteroperable map data by integrating contributions from leading companies and\nopen data sources to support a wide range of applications.\n\nOverture Maps offers a variety of datasets to query. For example, there is plenty\nof information about power infrastructure.\n\nLet's create some plots of the U.S. power infrastructure. We'll look into power\nplants and power lines for the lower 48 states (excluding Hawaii and Alaska for\nsimplicity of the bounding box).\n\n## Download data\n\nFirst we import Ibis, its [deferred expression object](https://ibis-project.org/reference/expression-generic.html#ibis.expr.api.deferred) `_` ,\nand we use our default backend, DuckDB:\n\n::: {#63d43dcf .cell execution_count=1}\n``` {.python .cell-code}\nimport ibis\nfrom ibis import _\n\ncon = ibis.get_backend() # default duckdb backend\n```\n:::\n\n\nWith Ibis and DuckDB we be more specific about the data we want thanks to the\nfilter push down. For example, if we want to select only a few columns and\nlook only at the power infrastructure when can do this as follow.\n\n\n```python\n# look into type infrastructure\nurl = (\n \"s3://overturemaps-us-west-2/release/2024-07-22.0/theme=base/type=infrastructure/*\"\n)\nt = con.read_parquet(url, table_name=\"infra-usa\")\n\n# filter for USA bounding box, subtype=\"power\", and selecting only few columns\nexpr = t.filter(\n _.bbox.xmin > -125.0,\n _.bbox.ymin > 24.8,\n _.bbox.xmax < -65.8,\n _.bbox.ymax < 49.2,\n _.subtype == \"power\",\n).select([\"names\", \"geometry\", \"bbox\", \"class\", \"sources\", \"source_tags\"])\n```\n\n::: {.callout-note}\nIf you inspect expr, you can see that the filters and projections get pushed down,\nmeaning you only download the data that you asked for.\n:::\n\n```python\ncon.to_parquet(expr, \"power-infra-usa.geoparquet\")\n```\n\nNow that we have the data lets explore it in Ibis interactive mode and make some\nbeautiful maps.\n\n## Data exploration\n\nTo explore the data interactively we turn on the interactive mode:\n\n::: {#78584082 .cell execution_count=2}\n``` {.python .cell-code}\nibis.options.interactive = True\n```\n:::\n\n\n::: {#766bc95b .cell execution_count=3}\n``` {.python .cell-code}\nusa_power_infra = con.read_parquet(\"power-infra-usa.geoparquet\")\nusa_power_infra\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n```{=html}\n
┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ names  geometry                                                                          bbox                                                                class        sources                                                                           source_tags                                  ┃\n┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stru…geospatial:geometrystruct<xmin: float32, xmax: float32, ymin: float32, ymax: float32>stringarray<struct<property: string, dataset: string, record_id: string, update_time:…map<string, string>                          │\n├───────┼──────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────┼─────────────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────┤\n│ NULL<POINT (-114.291 27.151)>{'xmin': -114.29095458984375, 'xmax': -114.29093933105469, ... +2}power_tower[{...}]{}                                           │\n│ NULL<POINT (-114.289 27.149)>{'xmin': -114.28852844238281, 'xmax': -114.28851318359375, ... +2}power_tower[{...}]{}                                           │\n│ NULL<POINT (-114.29 27.15)>{'xmin': -114.29006958007812, 'xmax': -114.29005432128906, ... +2}power_tower[{...}]{}                                           │\n│ NULL<POINT (-114.29 27.151)>{'xmin': -114.2900619506836, 'xmax': -114.29004669189453, ... +2}power_tower[{...}]{}                                           │\n│ NULL<POLYGON ((-114.29 27.152, -114.29 27.151, -114.29 27.15, -114.291 27.151, -...>{'xmin': -114.29100036621094, 'xmax': -114.28948974609375, ... +2}substation [{...}]{'location': 'outdoor', 'voltage': '115000'} │\n│ NULL<POINT (-114.291 27.151)>{'xmin': -114.29055786132812, 'xmax': -114.29054260253906, ... +2}portal     [{...}]{}                                           │\n│ NULL<POINT (-114.291 27.152)>{'xmin': -114.29069519042969, 'xmax': -114.29067993164062, ... +2}power_tower[{...}]{}                                           │\n│ NULL<POINT (-114.288 27.151)>{'xmin': -114.28756713867188, 'xmax': -114.28755187988281, ... +2}power_tower[{...}]{}                                           │\n│ NULL<POINT (-114.286 27.152)>{'xmin': -114.28639221191406, 'xmax': -114.286376953125, ... +2}power_tower[{...}]{}                                           │\n│ NULL<POINT (-114.285 27.152)>{'xmin': -114.2851333618164, 'xmax': -114.28511810302734, ... +2}power_tower[{...}]{}                                           │\n│                                             │\n└───────┴──────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────┴─────────────┴──────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\nLet's quickly rename the `class` column, since this is a reserved word and causes\nconflicts when using the deferred operator\n\n::: {#496105f1 .cell execution_count=4}\n``` {.python .cell-code}\nusa_power_infra = usa_power_infra.rename(infra_class=\"class\")\n```\n:::\n\n\nWe take a look at the different classes of infrastructure under the subtype power:\n\n::: {#0bd956c7 .cell execution_count=5}\n``` {.python .cell-code}\nusa_power_infra.infra_class.value_counts().order_by(\n ibis.desc(\"infra_class_count\")\n).preview(max_rows=15)\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓\n┃ infra_class    infra_class_count ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩\n│ stringint64             │\n├───────────────┼───────────────────┤\n│ power_pole   2961946 │\n│ power_tower  2600392 │\n│ generator    736083 │\n│ power_line   246948 │\n│ minor_line   112266 │\n│ substation   64868 │\n│ portal       58033 │\n│ switch       51485 │\n│ transformer  50456 │\n│ plant        12416 │\n│ cable        4977 │\n│ catenary_mast2921 │\n│ insulator    1713 │\n│ connection   1226 │\n│ terminal     402 │\n│  │\n└───────────────┴───────────────────┘\n
\n```\n:::\n:::\n\n\nLooks like we have `plant`, `power_line` and `minor_line` among others.\n\n::: {#e0bcbb9e .cell execution_count=6}\n``` {.python .cell-code}\nplants = usa_power_infra.filter(_.infra_class==\"plant\")\npower_lines = usa_power_infra.filter(_.infra_class==\"power_line\")\nminor_lines = usa_power_infra.filter(_.infra_class==\"minor_line\")\n```\n:::\n\n\n## Plotting with Lonboard\n\nLonboard is a Python plotting library optimized for efficient visualizations\nof large geospatial data. It integrates well with Ibis and DuckDB, making\ninteractive plotting scalable.\n\n::: {.callout-note}\nYou can try this in your machine, for the purpose the blog file size, we will show\nscreenshots of the visualization\n:::\n\n```python\nimport lonboard\nfrom lonboard.basemap import CartoBasemap # to choose color of basemap\n```\n\nLet's visualize the `power plants`\n\n```python\nlonboard.viz(\n plants,\n scatterplot_kwargs={\"get_fill_color\": \"red\"},\n polygon_kwargs={\"get_fill_color\": \"red\"},\n map_kwargs={\n \"basemap_style\": CartoBasemap.Positron,\n \"view_state\": {\"longitude\": -100, \"latitude\": 36, \"zoom\": 3},\n },\n)\n```\n\n![Power plants in the USA](usa-power-plants.png)\n\nIf you are visualizing this in your machine, you can zoom in and see some of the\ngeometry where the plants are located. As an example, we can plot in a small\narea of California:\n\n```python\nplants_CA = plants.filter(\n _.bbox.xmin.between(-118.6, -117.9), _.bbox.ymin.between(34.5, 35.3)\n).select(_.names.primary, _.geometry)\n```\n\n```python\nlonboard.viz(\n plants_CA,\n scatterplot_kwargs={\"get_fill_color\": \"red\"},\n polygon_kwargs={\"get_fill_color\": \"red\"},\n map_kwargs={\n \"basemap_style\": CartoBasemap.Positron,\n },\n)\n```\n\n![Power plants near Lancaster, CA](ca-power-plants.png)\n\nWe can also visualize together the `power_lines` and the `minor_lines` by doing:\n\n```python\nlonboard.viz([minor_lines, power_lines])\n```\n\n![Minor and Power lines of USA](usa-power-and-minor-lines.png)\n\nand that's how you can visualize ~7 million coordinates from the comfort of\nyour laptop.\n\n```python\n>>> power_lines.geometry.n_points().sum()\n5329836\n>>> minor_lines.geometry.n_points().sum()\n1430042\n```\n\nWith Ibis and DuckDB working with geospatial data has never been easier or faster.\nWe saw how to query a dataset from Overture Maps with the simplicity of Python and\nthe performance of DuckDB. Last but not least, we saw how simple and quick Lonboard\ngot us from query-to-plot. Together, these libraries make exploring and handling\ngeospatial data a breeze.\n\n\n## Resources\n- [Ibis Docs](https://ibis-project.org/)\n- [Lonboard Docs](https://developmentseed.org/lonboard/latest/)\n- [DuckDB spatial extension](https://duckdb.org/docs/extensions/spatial.html)\n- [DuckDB spatial functions docs](https://github.com/duckdb/duckdb_spatial/blob/main/docs/functions.md)\n\nChat with us on Zulip:\n\n- [Ibis Zulip Chat](https://ibis-project.zulipchat.com/)\n\n", "supporting": [ "index_files" ], diff --git a/docs/posts/ibis-overturemaps/index.qmd b/docs/posts/ibis-overturemaps/index.qmd index 4bcbe0814ca5..741b82ea1c82 100644 --- a/docs/posts/ibis-overturemaps/index.qmd +++ b/docs/posts/ibis-overturemaps/index.qmd @@ -12,61 +12,71 @@ execute: freeze: false --- -With the release of `DuckDB 1.1.0`, now we have support for reading GeoParquet -files! With this exciting update we can query rich datasets from Overture Maps using python via Ibis with the performance of `DuckDB`. +With the release of `DuckDB 1.1.1`, now we have support for reading GeoParquet +files! With this exciting update we can query rich datasets from Overture Maps +using python via Ibis with the performance of `DuckDB`. -But the good news doesn't stop there, since `Ibis 9.2`, `lonboard` can plot data directly from an `Ibis` table, adding more simplicity and speed to your geospatial analysis. +But the good news doesn't stop there, since `Ibis 9.2`, `lonboard` can plot data +directly from an `Ibis` table, adding more simplicity and speed to your +geospatial analysis. Let’s dive into how these tools come together. ## Installation -Install Ibis with the dependencies needed to work with geospatial data using DuckDB. To be able to read GeoParquet files the DuckDB version should be `>=1.1.0`. - -::: {.callout-note} -At the moment DuckDB 1.1.0 has a bug that prevents us from querying and writing the data to parquet using Ibis and DuckDB, so we are installing the `overturemaps` CLI to get the data. -::: +First make sure you have `duckdb=1.1.1`, then install Ibis with the dependencies +needed to work with geospatial data using DuckDB. To be able to read GeoParquet +files the DuckDB version should be `>=1.1.1`. ```bash -$ pip install 'ibis-framework[duckdb,geospatial]' lonboard overturemaps +$ pip install duckdb==1.1.1 +$ pip install 'ibis-framework[duckdb,geospatial]' lonboard ``` ## Motivation -Overture Maps offers a variety of datasets to query. Let's create -some plots related to U.S. power infrastructure. We'll look -into power plants and power lines for the lower 48 states (excluding Hawaii and -Alaska for simplicity of the bounding box). +Overture Maps is an open-source initiative that provides high-quality, +interoperable map data by integrating contributions from leading companies and +open data sources to support a wide range of applications. -## Download data - -**NOTE: not sure if I should have this here or avoid it because it's broken** +Overture Maps offers a variety of datasets to query. For example, there is plenty +of information about power infrastructure. -With Ibis and DuckDB we could download only the necessary columns needed but at -the moment this fails due to a bug. +Let's create some plots of the U.S. power infrastructure. We'll look into power +plants and power lines for the lower 48 states (excluding Hawaii and Alaska for +simplicity of the bounding box). -For future reference when this gets fixed, the expression would look like this: +## Download data -```python +First we import Ibis, its [deferred expression object](https://ibis-project.org/reference/expression-generic.html#ibis.expr.api.deferred) `_` , +and we use our default backend, DuckDB: +```{python} import ibis from ibis import _ -con = ibis.get_backend() +con = ibis.get_backend() # default duckdb backend +``` + +With Ibis and DuckDB we be more specific about the data we want thanks to the +filter push down. For example, if we want to select only a few columns and +look only at the power infrastructure when can do this as follow. + +```python # look into type infrastructure -url = "s3://overturemaps-us-west-2/release/2024-07-22.0/theme=base/type=infrastructure/*" +url = ( + "s3://overturemaps-us-west-2/release/2024-07-22.0/theme=base/type=infrastructure/*" +) t = con.read_parquet(url, table_name="infra-usa") # filter for USA bounding box, subtype="power", and selecting only few columns -expr = t.filter(_.bbox.xmin > -125.0, _.bbox.ymin > 24.8, - _.bbox.xmax < -65.8, _.bbox.ymax < 49.2, - _.subtype=="power" - ).select(["names", - "geometry", - "class", - "sources", - "source_tags"]) - +expr = t.filter( + _.bbox.xmin > -125.0, + _.bbox.ymin > 24.8, + _.bbox.xmax < -65.8, + _.bbox.ymax < 49.2, + _.subtype == "power", +).select(["names", "geometry", "bbox", "class", "sources", "source_tags"]) ``` ::: {.callout-note} @@ -75,62 +85,53 @@ meaning you only download the data that you asked for. ::: ```python -# to write it to a parquet file we would do (currently broken) -con.to_parquet(expr, "power-infra-usa.parquet") -``` - -But, no worries! Overture Maps has a nice [python CLI](https://docs.overturemaps.org/getting-data/overturemaps-py/) that allow us to download -some the data, we can filter by bounding box and type but any other filters will have to happen afterwards. - -```bash -$ overturemaps download --bbox=-125.0,24.8,-65.8,49.2 -f geoparquet --type=infrastructure -o usa-infra.geoparquet +con.to_parquet(expr, "power-infra-usa.geoparquet") ``` -Now that we have the data lets explore it in Ibis interactive mode and make some beautiful maps. +Now that we have the data lets explore it in Ibis interactive mode and make some +beautiful maps. ## Data exploration +To explore the data interactively we turn on the interactive mode: ```{python} -import ibis -from ibis import _ - ibis.options.interactive = True -con = ibis.get_backend() # default duckdb backend ``` ```{python} -usa_infra = con.read_parquet("usa-infra.geoparquet") -usa_infra +usa_power_infra = con.read_parquet("power-infra-usa.geoparquet") +usa_power_infra ``` -We take a look at the subtypes of infrastructure: +Let's quickly rename the `class` column, since this is a reserved word and causes +conflicts when using the deferred operator ```{python} -usa_infra.subtype.value_counts().preview(max_rows=15) +usa_power_infra = usa_power_infra.rename(infra_class="class") ``` -If we want to look at power lines we need to look into `subtype=="power"` -and see what kind of `class` we have in there. +We take a look at the different classes of infrastructure under the subtype power: ```{python} -(usa_infra.filter(_.subtype=="power")["class"] - .value_counts() - .order_by(ibis.desc("class_count")) - ) +usa_power_infra.infra_class.value_counts().order_by( + ibis.desc("infra_class_count") +).preview(max_rows=15) ``` -Looks like we have `plants`, `power_lines` and `minor_lines`, so we can make get some nice maps. +Looks like we have `plant`, `power_line` and `minor_line` among others. ```{python} -plants = usa_infra.filter(_.subtype=="power", usa_infra["class"]=="plant") -power_lines = usa_infra.filter(_.subtype=="power", usa_infra["class"]=="power_line") -minor_lines = usa_infra.filter(_.subtype=="power", usa_infra["class"]=="minor_line") +plants = usa_power_infra.filter(_.infra_class=="plant") +power_lines = usa_power_infra.filter(_.infra_class=="power_line") +minor_lines = usa_power_infra.filter(_.infra_class=="minor_line") ``` -## Plotting +## Plotting with Lonboard -**Note: maybe here explain why lonboard ?** +Lonboard is a Python plotting library optimized for efficient visualizations +of large geospatial data. It integrates well with Ibis and DuckDB, making +interactive plotting scalable. ::: {.callout-note} You can try this in your machine, for the purpose the blog file size, we will show @@ -145,49 +146,53 @@ from lonboard.basemap import CartoBasemap # to choose color of basemap Let's visualize the `power plants` ```python -lonboard.viz(plants, - scatterplot_kwargs={"get_fill_color": "red"}, - polygon_kwargs={"get_fill_color": "red"}, - map_kwargs={"basemap_style": CartoBasemap.Positron, - "view_state": {"longitude": -100, "latitude": 36, "zoom": 3} - }) +lonboard.viz( + plants, + scatterplot_kwargs={"get_fill_color": "red"}, + polygon_kwargs={"get_fill_color": "red"}, + map_kwargs={ + "basemap_style": CartoBasemap.Positron, + "view_state": {"longitude": -100, "latitude": 36, "zoom": 3}, + }, +) ``` ![Power plants in the USA](usa-power-plants.png) If you are visualizing this in your machine, you can zoom in and see some of the -geometry where the plants are located. But as an example, we can plot in a small +geometry where the plants are located. As an example, we can plot in a small area of California: ```python -plants_CA = plants.filter(_.bbox.xmin.between(-118.6, -117.9), - _.bbox.ymin.between(34.5, 35.3))[_.names.primary, _.geometry] +plants_CA = plants.filter( + _.bbox.xmin.between(-118.6, -117.9), _.bbox.ymin.between(34.5, 35.3) +).select(_.names.primary, _.geometry) ``` ```python -lonboard.viz(plants_CA, - scatterplot_kwargs={"get_fill_color": "red"}, - polygon_kwargs={"get_fill_color": "red"}, - map_kwargs={"basemap_style": CartoBasemap.Positron, - }) +lonboard.viz( + plants_CA, + scatterplot_kwargs={"get_fill_color": "red"}, + polygon_kwargs={"get_fill_color": "red"}, + map_kwargs={ + "basemap_style": CartoBasemap.Positron, + }, +) ``` ![Power plants near Lancaster, CA](ca-power-plants.png) We can also visualize together the `power_lines` and the `minor_lines` by doing: - ```python lonboard.viz([minor_lines, power_lines]) ``` ![Minor and Power lines of USA](usa-power-and-minor-lines.png) -and that's how you can visualize ~7 million points from the comfort of +and that's how you can visualize ~7 million coordinates from the comfort of your laptop. -Note: I got the ~7M by adding the number of points in power_lines and minor_lines. I'm not sure if plotting the lines that connect these points add points to this. - ```python >>> power_lines.geometry.n_points().sum() 5329836