-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Tuning Docs #415
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #415 +/- ##
==========================================
+ Coverage 95.01% 95.02% +0.01%
==========================================
Files 198 198
Lines 5658 5692 +34
Branches 178 175 -3
==========================================
+ Hits 5376 5409 +33
- Misses 282 283 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a few minor suggestions and a couple of questions.
Nothing major.
Data Engineering | ||
################ | ||
|
||
It is important for customers to have Data Engineering (DE) tables tuned for performance. Mosaic has been designed to operate directly on standard interchange formats which may appear in various DataFrame columns; currently, WKT, WKB, and GeoJSON are supported, with WKB offering the best properties for storing and querying. Further, while capabilities are always advancing, Mosaic does not expose geometry types as defined in `OGC Simple Features for SQL <https://www.ogc.org/standard/sfs/>`__, making investments in DE all the more beneficial. Additionally, for best tuning, pre-standardize all your geospatial data into the same SRID; for most applications, this should be 4326, see `CRS Docs <https://databrickslabs.github.io/mosaic/usage/grid-indexes-bng.html#coordinate-reference-system>`__. *As of Mosaic 0.3 series, customers need to prepare tables for performant queries, though we are exploring options to improve this experience going forward.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjohns-databricks We highlight here to use WKB (WKT/WKB do not support SRIDs) while we are mentioning conversion to 4326 (I would avoid this since it isnt really true, it makes sense due to H3 only) and SRIDs are only supported in GeoJSON, so format conversion is required in the current version.
We can easily add support in the future for handling SR projections for WKT and WKB if both src and dst SRIDs are provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to convey that for standardized DE, customers want to consider SRIDs. What we see often is Shapefiles or GDBs that need conversion so all data layers line up with other data, e.g. with GeoPandas:
- Customers might invoke
GeoDataFrame.to_crs(crs=4326)[[source]](https://github.com/geopandas/geopandas/blob/main/geopandas/geodataframe.py#L1344-L1427)
to standardize - Then customers store in Delta Lake the
geometry
column as WKT or WKB (agreed no SRID there)
We can have a small callout of BNG if you like, with H3 (uses 4326) as the primary path. Also, bear in mind this is for DE practices to allow spatial SQL queries to be ready elsewhere in our platform (DBR, eventually DBSQL). If you have a recommended sentence or two to better clarify, please call it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a very costly operation and introduces conversion errors. So I would only advise it when absolutely need that conversion. I am just conscious that people that come with strong GIS background will lose trust in the tool if we openly argue that 4326 is where all data should sit in.
To your point, we have BNG and Custom Grid, so for smaller estates this may not be necessary at all.
I am onboard with this being the main advise, but we should list caveats openly otherwise it will be incomplete advise since in some cases it really depends on other factors.
|
||
1. Call Mosaic's `grid_tessellateexplode <https://databrickslabs.github.io/mosaic/api/spatial-indexing.html#grid-tessellateexplode>`__ on your geometries at a given resolution; in addition to exploding to per row, this adds an ``index`` column, a struct with the following fields: ``is_core``, ``index_id``, ``wkb``; you essentially now have the result of H3 covers operation on the geometries and knowledge of which index space is fully contained within the geometry and WKB chips for the boundaries as well as core index; this is the fundamental call for performance | ||
2. Call something like ``select index.*, * except(index)`` to pull the struct fields into top level columns; also, adding them first to ensure they are part of the table statistics; you can be more thoughtful based on your data, e.g. don't need WKB in the statistics | ||
3. Save the table as delta lake and call ``optimize <table or delta path> zorder (index_id)`` to get proper layout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we add a sentance on timestamp columns here as well for proper spatio-temporal considerations, if z ordering by cell + timestamp in that order you get better fetching of data for constructing timeseries for each cell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good callout that often comes up! Another is I want to emphasize that H3 cell ids need to be numbers not strings (not a problem with Mosaic calls, but customers need to be aware of the perf differences and standardize everywhere to numbers).
|
||
(b) when you are interested in precise results, test boundary chip information | ||
|
||
.. code-block:: sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add python examples as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, ok.
|
||
sql(f"""OPTIMIZE {tbl_fqn} ZORDER BY ({cell_col})""") | ||
|
||
This pattern allows us to level out polygon "worst case" max areas, e.g. resolution 5 and then down to 9 for large area boundaries. You will notice that only ``is_core`` is checked at resolution 5 which is cheap, then resolution 9 has both ``is_core`` and ``st_contains`` in the clause. That query pattern then looks like (showing SQL): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a bit more explanation, why we use only core match at 5, I can see in code that you filter out cells based on is core flag, but to the first time reader this may not be obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can expand a little more here.
Tuning Tips | ||
########### | ||
|
||
Spark `Adaptive Query Exection (AQE) <https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution>`__ is tuned for data-heavy processing where each row has a cost. It likes those rows to be even and has a hard time reasoning about hidden compute costs as incurred when ``ST_`` functions, e.g. ``ST_Contains`` are invoked. To bring more control AQE can be turned completely off: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say here "each row has the same cost to process"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will add. I think the next sentence is clear expansion: "It likes those rows to be even and has a hard time reasoning about hidden compute costs..."
|
||
.. code-block:: py | ||
|
||
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", True) # <- default is True, may want False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say this should always be false for spatial use cases, and sometimes should be turned back on, and only if both tables are point tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the query pattern, can slow down queries when set to False, can also help (when set to False) e.g. when you are relying on shuffle partitions alone to tune your workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the config that gave me personally the most pain, I almost exclusively turn it off and havent seen many cases where I needed to turn it back on (except when data is over-fragmented on disk)
|
||
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", True) # <- default is True, may want False | ||
spark.conf.set("spark.sql.adaptive.coalescePartitions.parallelismFirst", False) # <- default is True (respect size) | ||
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "24MB") # <- default is 64MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go even lower here, but it is very hard to say what is the correct number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have messed with others (16MB being lower but also 36MB and 48MB) and frankly have landed on 24MB as a good goldilocks zone that is ok for smaller and larger data.
|
||
.. code-block:: py | ||
|
||
spark.conf.set("spark.sql.shuffle.partitions", 256) # <- default is 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 256 is too close to 200 to demonstrate the point, I would go with 500 or something like that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I pulled this from some ongoing auto-tuning work where too large of shuffle partitions can have a significant effect on query times for smaller data, but makes sense to bump up to 512 to be clear that it is a real adjustment.
a9aed81
to
3c5ab2e
Compare
Initial tuning docs for Mosaic.