-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stale Case Data Mail Report #35446
Stale Case Data Mail Report #35446
Conversation
Code looks fine, though I only skimmed it. Do you know how many domains currently meet this criteria on prod? I wonder if it would be helpful to filter them, possibly by subscription, to domains that solutions is most likely to take action on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice. Just a couple of suggestions.
) | ||
|
||
@property | ||
def rows(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using table.rows
twice will run the query twice. How about something like this?
def __init__():
self._rows = None
@property
def rows(self):
if self._rows is None:
self._rows = []
# etc.
return self._rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea! Currently the "caching" is done in the task itself by calling table_rows = StaleCasesTable().rows
and then using the table_rows
var locally, but the above suggestion will remove the need for storing the rows locally in the task.
corehq/apps/hqadmin/reports.py
Outdated
@staticmethod | ||
def format_as_table(row_data, data_tables_header): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a useful function, but it is more generic than this class, and it expects its headers to be a list of DataTablesHeader instances. For those reasons, I would pull it out of here, move it somewhere under corehq/apps/reports/datatables/
, and maybe rename it. If you feel that having a method is useful, you could change this to something like:
def format_as_table(self):
return datatable_as_text(self.rows, self.headers)
(Obviously assumes that self.rows
doesn't run the query again.)
(Not applicable here because it's probably not worth an additional requirement, but useful to know about: Tablib is cool, and it can do this. (Also, I'm biased, because I wrote that bit.))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. We no longer need this function as I'll be refactoring the task to save the data to a csv file instead. I'm therefore going to completely remove this method.
I thought it might be useful to keep it around and put it in corehq/apps/reports/datatables
as you suggested for future use but, in saying that, YAGNI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
To make the ES query performant enough and consistently successful I have refactored how the report does its query a bit. Now the ES query will aggregate data in date slices of 180 days at a time. Please do let me know if you have any thoughts/concerns with the above strategy, and am happy to discuss alternatives as well. |
corehq/apps/hqwebapp/tasks.py
Outdated
f'{table.STALE_DATE_THRESHOLD_DAYS} days.' | ||
) | ||
if has_error: | ||
message += ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: In this scenario, I think it is probably better to emphasize on the error message first and then speak about no domains found in the partial compilation (or just keep the error message). I say this because I am not sure how relevant is the above message to the end user in case of error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. This could potentially be confusing to have a message that says no domains found followed by one that says there was an error. I'll refactor this so that the default no domains message is only shown if there wasn't an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in ee92b43
A great idea to split this into days. The other alternative to split is probably domains however days seems like a better candidate to achieve an even distribution for the split. Good thinking on having the backoff strategy to make it more reliable. |
Hey @zandre-eng Just checking if we considered the opposite appraoch? |
@mkangia Testing this out the query times out when trying to fetch all domans that have had a case update in the last year. To be able to fetch this we would need to implement date chunking here as well and I believe the complexity/cost of implementing these further additional queries outweighs the benefits we get from filtering out these extra domains. |
b228b40
to
7ec4e5b
Compare
Hey @zandre-eng
oh wow! I thought that that would perform better. One suggestion I could think of is to instead use forms. I believe you could also make this better for either approach by iterating with active subscriptions, and then find domains for those, and then look for their last case update or form submission flagging them as inactive if needed. Feel free to go with the approach currently in the PR if you think that will work well or keep them as a follow up. It does seem the requirement here is to find inactive domains and that can be done in different ways. |
@mkangia Docs in the
We could query using
The only problem with the above however, is that we can't easily filter out forms that affect currently closed cases. This means querying for forms older than a year will return a lot more domains than we're interested in. We would need to further filter down this list by some other means, such as using the
Would this not significantly increase the amount of queries we would need to do? The If we had 100 active accounts this would already result in 200 queries, which is more than the ~38 we're doing now. Please do let me know though if I misunderstood regarding the above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good! Just a suggestion.
corehq/apps/hqadmin/reports.py
Outdated
|
||
def _aggregate_case_count_data(self): | ||
end_date = datetime.now() - timedelta(days=self.STALE_DATE_THRESHOLD_DAYS) | ||
agg_res = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change what you have. Just a suggestion: If you do this ...
agg_res = {} | |
agg_res = defaultdict(lambda: 0) |
... then you can do this ...
def _merge_agg_data(self, agg_res, query_res):
for domain, case_count in query_res.items():
agg_res[domain] += case_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice improvement! I completely forgot about the handy defaultdict
type. I'll add this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 25d9dc5
Hey @zandre-eng
Do you mean system forms?
That is correct. More queries that could run fast and effectively. I am assuming that running through ES for a full search at once is going to be a pain over time. I'll be honest that I have not considered all variables of this so I am just suggesting things from a high level understanding. Feel free to keep the original implementation if that is a better one. |
Hey @mkangia
That is a good idea, however I'm still not aware of a way to easily filter out non-system forms that affect currently closed cases. I imagine this would still result in some additional queries to try and figure out which forms affect which cases.
We're not entirely doing a full search, since we do fetch a list of active domains above the community plan before doing the ES queries. I believe these queries should get faster as domains start to close off their old cases.
Thanks for the brainstorming suggestions, these are some interesting ideas to think about. Given that the total query time is able to be completed in under a minute or two, and that this will be run very infrequently as an internal only report at off peak times, I'm going to leave the implementation as is for now. We can always do follow up if any of the mentioned conditions change. |
Sounds good to me @zandre-eng 👍 |
Product Description
No user-facing changes. This new report is only sent out via mail and is not included in HQ's UI.
Technical Summary
Link to ticket here.
A new periodic Celery task is created with the responsibility of sending out a email report to the
SOLUTIONS_AES_EMAIL
email address on a monthly basis. This is a simple report that aggregates case counts for domains that have open cases that have not been modified in over a year.The query logic itself has date chunking implemented, which essentially means the queries will be made in date slices of 180 days at a time. The slices will go back 20 years in these slices. This specific date has been picked as prod testing has shown that this is about where the oldest cases reside.
Additionally, the query logic also implements backoff logic. Essentially, if the query times out then the date slice will be made thinner by 30 days. If this happens more than 2 times then the table will stop aggregating data entirely and will simply return the data it has aggregated so far. This logic has been implemented specifically with production's scale in mind, as the query can sometimes time out on the 1-2 most recent time slices due to the amount of data there is to aggregate.
The goal of this report is to identify which domains could be potential candidates for auto-update rules to be implemented.
Feature Flag
None
Safety Assurance
Safety story
The Celery task has also been tested on the staging environment to ensure that the email and attachment are correctly sent.
Automated test coverage
Unit tests exist for the new
StaleCasesTable
class and test to make sure that it retrieves the correct data and formats it appropriately.QA Plan
No QA planned.
Rollback instructions
Labels & Review