Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc: new docs demands from slack #217

Open
tuhaihe opened this issue Dec 18, 2024 — with Slack · 2 comments
Open

Doc: new docs demands from slack #217

tuhaihe opened this issue Dec 18, 2024 — with Slack · 2 comments

Comments

Copy link
Member

tuhaihe commented Dec 18, 2024

A colleague of mine asked the following questions:

My questions around a cloudberry DB that is used as a datawarehouse (OLAP applications, reporting/BI)
• any specific optimization options for analytical queries / aggregation functions
• best practices in the data model design (star/snowflake-schema; normalized vs. denormalized; redundancy issues)
• best practices in the configuration
• any thoughts on massive deletes
• any thoughts on doing outer joins of 10+ tables
• any specific recommendations regarding partioning, indexing, views/materialized views,...

Slack Message

@xinzweb
Copy link

xinzweb commented Dec 18, 2024

Few ideas here based on experiences:

  • Use ORCA and encourage hash join and hash aggregate for OLAP queries
  • Use "star", don't use "snowflake" to reduce the number of joins, denormalize your data to change join into filters
  • Configuration, this deserve its own book
  • Partition your data, so that you don't delete, just drop the entire partition
  • Outer join of 10+ tables? well, answered in the first one, rely on ORCA, or answered in the second one, use star schema to reduce the number of joins
  • Unless you do singleton lookup, don't index. Views are only for management purpose, not for performance. Materialized views would be a great idea if you have a relative fixed workloads. For the partition, make your partition medium sized, and leverage the static and dynamic partition elimination.

Again, those are very high level ideas to optimize for "append-only, truncate-only, mostly read-only" OLAP workloads. If you are thinking OLTP or any HTAP workloads, the above recommendation will not work.

I agree with @tuhaihe that, we need to have extended documentation with general guidance as well as real world cases to demonstrate the tradeoff for the guidance above. Those are great questions, and please keep asking. Thanks.

@lmugnano4537
Copy link

Here is a deck I share with my clients who are building a data warehouse (deck is generic and part of larger demo I do). It's showing generic code I built and provide for customers but it also shows my conceptual best practices. You can see I lean towards a Kimball based model (star or snowflake) because most of the time I'm pushing my customers towards a self-service BI model and feel dimensional modeling is the easiest for business users to understand. Physical design on this for me generally leads towards dimensions being AO columnar compressed and facts being AO row compressed and usually partitioned on one of the primary dates.

I just started working with a new data modeling tool called DbSchema (https://dbschema.com/) and gave them code for Greenplum that they recently incorporated into the product so it's my tool of choice.

One warning though, A LOT depends on the BI tool being used. You need to make sure the tool is pushing down the aggregation and only pulling back the columns the user is "slicing and dicing on" as shown by the Tableau examples in the deck (look at the queries from the DB side). Other tools don't generate good queries and want to just extract everything (horrible) in which case columnar wouldn't work well (honestly nothing works well with those tools, might as well just feed them flat files)

reporting_db_design_example.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants