Doc: new docs demands from slack #217

tuhaihe · 2024-12-18T14:32:54Z

A colleague of mine asked the following questions:

My questions around a cloudberry DB that is used as a datawarehouse (OLAP applications, reporting/BI)
• any specific optimization options for analytical queries / aggregation functions
• best practices in the data model design (star/snowflake-schema; normalized vs. denormalized; redundancy issues)
• best practices in the configuration
• any thoughts on massive deletes
• any thoughts on doing outer joins of 10+ tables
• any specific recommendations regarding partioning, indexing, views/materialized views,...

Slack Message

xinzweb · 2024-12-18T14:56:30Z

Few ideas here based on experiences:

Use ORCA and encourage hash join and hash aggregate for OLAP queries
Use "star", don't use "snowflake" to reduce the number of joins, denormalize your data to change join into filters
Configuration, this deserve its own book
Partition your data, so that you don't delete, just drop the entire partition
Outer join of 10+ tables? well, answered in the first one, rely on ORCA, or answered in the second one, use star schema to reduce the number of joins
Unless you do singleton lookup, don't index. Views are only for management purpose, not for performance. Materialized views would be a great idea if you have a relative fixed workloads. For the partition, make your partition medium sized, and leverage the static and dynamic partition elimination.

Again, those are very high level ideas to optimize for "append-only, truncate-only, mostly read-only" OLAP workloads. If you are thinking OLTP or any HTAP workloads, the above recommendation will not work.

I agree with @tuhaihe that, we need to have extended documentation with general guidance as well as real world cases to demonstrate the tradeoff for the guidance above. Those are great questions, and please keep asking. Thanks.

lmugnano4537 · 2024-12-20T19:55:01Z

Here is a deck I share with my clients who are building a data warehouse (deck is generic and part of larger demo I do). It's showing generic code I built and provide for customers but it also shows my conceptual best practices. You can see I lean towards a Kimball based model (star or snowflake) because most of the time I'm pushing my customers towards a self-service BI model and feel dimensional modeling is the easiest for business users to understand. Physical design on this for me generally leads towards dimensions being AO columnar compressed and facts being AO row compressed and usually partitioned on one of the primary dates.

I just started working with a new data modeling tool called DbSchema (https://dbschema.com/) and gave them code for Greenplum that they recently incorporated into the product so it's my tool of choice.

One warning though, A LOT depends on the BI tool being used. You need to make sure the tool is pushing down the aggregation and only pulling back the columns the user is "slicing and dicing on" as shown by the Tableau examples in the deck (look at the queries from the DB side). Other tools don't generate good queries and want to just extract everything (horrible) in which case columnar wouldn't work well (honestly nothing works well with those tools, might as well just feed them flat files)

reporting_db_design_example.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc: new docs demands from slack #217

Doc: new docs demands from slack #217

tuhaihe commented Dec 18, 2024

xinzweb commented Dec 18, 2024

lmugnano4537 commented Dec 20, 2024

Doc: new docs demands from slack #217

Doc: new docs demands from slack #217

Comments

tuhaihe commented Dec 18, 2024

xinzweb commented Dec 18, 2024

lmugnano4537 commented Dec 20, 2024