-
-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create DomainMetrics SQL table #35415
base: master
Are you sure you want to change the base?
Conversation
corehq/apps/data_analytics/models.py
Outdated
|
||
class DomainStats(models.Model): | ||
domain = models.TextField(unique=True, db_index=True) | ||
stats = models.JSONField() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all stats have the same data type? Could they be normalized out so this table would have a name
and value
field, and there would be one row for each stat name/value pair?
Denormalization can also be achieved if values have different data types, although it gets a bit more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I should have highlighted the rationale for lumping into one json field. There are a quite a few calculated properties, so it felt "easiest" to just lump them into one field, though we had only considered completely denormalizing data into one row, not a row for each domain/calculated property pair. That does sound interesting, and it would still be easy to collect all calculated properties for a domain 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also didn't answer your question. Apart from some boolean fields (which could be represented by an integers) and the "cp_last_updated" field which could be replaced by the last_modified field, these are all integers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalizing the data would make it MUCH easier to run aggregate queries across multiple domains, which seems like it may have some interesting use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gherceg there are what I believe are a few other date fields as well: "cp_first_form", "cp_last_form", "cp_300th_form" -- so ultimately we've got dates, integers, and booleans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@millerdev One thing I'd be concerned about is the performance of updating ~55 rows per domain every night, instead of a large value in a single row like using a combined JSONField would do. Even if we're only conditionally updating rows, and also saying that maybe 10 of the properties change any given day on an average domain, it seems like we'd still have to at least access all 55 properties for those domains to check whether they need updating. The task to update calculated properties currently takes ~2 hours to run when writing to ES, and I worry that would be even longer with one SQL row per property.
Some caveats though, are that I don't personally know how ES compares to SQL in terms of efficiency, so it's possible it would still be fine, and also I might missing some obvious optimization here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on how often the data changes. If we simply overwrite every record every night, that would churn every single row in the table because Postgres deletes the old and creates a new record on every update (regardless of using normalized table or JsonField). If most of the 55 values do not change, then they can be efficiently updated with a query like
UPDATE the_table SET value = %(value)s
WHERE domain = %(domain)s AND name = %(name)s AND value != %(value)s;
Which does nothing if the value did not change. A bulk UPSERT could be used if values sometimes need to INSERT
instead of UPDATE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up adding a dedicated column for every property as that seemed like the simplest approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just noting that there are guaranteed to be columns that don't need to be calculated once initially set (like 300th form submission). It could be nice to build that into the task that ultimately calculates these values to first check if it needs to in the cases where applicable.
Still viewing this as a WIP. I wanted to have a conversation around possible defaults for these fields, but also think it is the case that when we first create a DomainMetrics row for a domain, we should be populating every column. Once it exists, we only need to update columns that have changed. That is my logic for not providing any default values, nullable fields, or blank fields. |
Product Description
Technical Summary
https://dimagi.atlassian.net/browse/SAAS-16282
@nospame identified an issue in how we store calculated properties for domains in elasticsearch. How ElasticProcessor in pillows is setup to always index documents to elasticsearch. We could make changes to the processor so that it updates instead indexes (merges new changes into the existing doc instead of replacing any existing doc). However there felt like a lot of unknowns associated with that solution, and that it would be easier to just create a new SQL table to store calculated properties for domains into.
That is where this PR comes in. This is the first in a series of PRs to switch our calculated properties from ES to SQL. We went with the name
DomainStats
since that felt concise and appropriate, but are open to suggestions. We also chose to put it in thedata_analytics
app since it aligns with that idea and is similar in function to reports like the MALT. This means other calculated properties code will be moved to this app as we work through this.Feature Flag
Safety Assurance
Safety story
Adds a new, unused table.
Automated test coverage
QA Plan
No
Migrations
Rollback instructions
Labels & Review