Stale Case Data Mail Report #35446

zandre-eng · 2024-11-26T12:28:26Z

Product Description

No user-facing changes. This new report is only sent out via mail and is not included in HQ's UI.

Technical Summary

Link to ticket here.

A new periodic Celery task is created with the responsibility of sending out a email report to the SOLUTIONS_AES_EMAIL email address on a monthly basis. This is a simple report that aggregates case counts for domains that have open cases that have not been modified in over a year.

The query logic itself has date chunking implemented, which essentially means the queries will be made in date slices of 180 days at a time. The slices will go back 20 years in these slices. This specific date has been picked as prod testing has shown that this is about where the oldest cases reside.

Additionally, the query logic also implements backoff logic. Essentially, if the query times out then the date slice will be made thinner by 30 days. If this happens more than 2 times then the table will stop aggregating data entirely and will simply return the data it has aggregated so far. This logic has been implemented specifically with production's scale in mind, as the query can sometimes time out on the 1-2 most recent time slices due to the amount of data there is to aggregate.

The goal of this report is to identify which domains could be potential candidates for auto-update rules to be implemented.

Feature Flag

None

Safety Assurance

Safety story

Local testing
Unit tests

The Celery task has also been tested on the staging environment to ensure that the email and attachment are correctly sent.

Automated test coverage

Unit tests exist for the new StaleCasesTable class and test to make sure that it retrieves the correct data and formats it appropriately.

QA Plan

No QA planned.

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

orangejenny · 2024-11-26T13:55:30Z

Code looks fine, though I only skimmed it. Do you know how many domains currently meet this criteria on prod? I wonder if it would be helpful to filter them, possibly by subscription, to domains that solutions is most likely to take action on.

kaapstorm

This is nice. Just a couple of suggestions.

kaapstorm · 2024-11-29T11:09:11Z

corehq/apps/hqadmin/reports.py

+        )
+
+    @property
+    def rows(self):


Using table.rows twice will run the query twice. How about something like this?

def __init__(): self._rows = None @property def rows(self): if self._rows is None: self._rows = [] # etc. return self._rows

I like this idea! Currently the "caching" is done in the task itself by calling table_rows = StaleCasesTable().rows and then using the table_rows var locally, but the above suggestion will remove the need for storing the rows locally in the task.

kaapstorm · 2024-11-29T11:28:49Z

corehq/apps/hqadmin/reports.py

+    @staticmethod
+    def format_as_table(row_data, data_tables_header):


This is a useful function, but it is more generic than this class, and it expects its headers to be a list of DataTablesHeader instances. For those reasons, I would pull it out of here, move it somewhere under corehq/apps/reports/datatables/, and maybe rename it. If you feel that having a method is useful, you could change this to something like:

def format_as_table(self): return datatable_as_text(self.rows, self.headers)

(Obviously assumes that self.rows doesn't run the query again.)

(Not applicable here because it's probably not worth an additional requirement, but useful to know about: Tablib is cool, and it can do this. (Also, I'm biased, because I wrote that bit.))

That is a good point. We no longer need this function as I'll be refactoring the task to save the data to a csv file instead. I'm therefore going to completely remove this method.
I thought it might be useful to keep it around and put it in corehq/apps/reports/datatables as you suggested for future use but, in saying that, YAGNI.

ajeety4

Looks good to me.

zandre-eng · 2024-12-12T10:17:18Z

To make the ES query performant enough and consistently successful I have refactored how the report does its query a bit. Now the ES query will aggregate data in date slices of 180 days at a time.
I have also implemented backoff logic so that the date slices become thinner if the query times out. When testing on production I have noted that this may happen and looks to be dependent on whether the index has been cached in some way. The worst case scenario I have witnessed for production is that it might time out the first 1-2 queries before successfully completing all of them.

Please do let me know if you have any thoughts/concerns with the above strategy, and am happy to discuss alternatives as well.

ajeety4 · 2024-12-13T06:29:20Z

corehq/apps/hqwebapp/tasks.py

+            f'{table.STALE_DATE_THRESHOLD_DAYS} days.'
+        )
+        if has_error:
+            message += (


nit: In this scenario, I think it is probably better to emphasize on the error message first and then speak about no domains found in the partial compilation (or just keep the error message). I say this because I am not sure how relevant is the above message to the end user in case of error.

That is a good point. This could potentially be confusing to have a message that says no domains found followed by one that says there was an error. I'll refactor this so that the default no domains message is only shown if there wasn't an error.

Addressed in ee92b43

corehq/apps/hqadmin/reports.py

ajeety4 · 2024-12-13T07:07:27Z

To make the ES query performant enough and consistently successful I have refactored how the report does its query a bit. Now the ES query will aggregate data in date slices of 180 days at a time. I have also implemented backoff logic so that the date slices become thinner if the query times out. When testing on production I have noted that this may happen and looks to be dependent on whether the index has been cached in some way. The worst case scenario I have witnessed for production is that it might time out the first 1-2 queries before successfully completing all of them.

Please do let me know if you have any thoughts/concerns with the above strategy, and am happy to discuss alternatives as well.

A great idea to split this into days. The other alternative to split is probably domains however days seems like a better candidate to achieve an even distribution for the split. Good thinking on having the backoff strategy to make it more reliable.
I don't foresee any concerns and this does seem like a ramp up to the previous approach.

mkangia · 2024-12-13T08:19:49Z

Hey @zandre-eng

Just checking if we considered the opposite appraoch?
fetch domains that have a case update in last one year and then filter those from a list of relevant domains like the ones that have an active subscription?

zandre-eng · 2024-12-13T08:47:24Z

Just checking if we considered the opposite approach?
fetch domains that have a case update in last one year and then filter those from a list of relevant domains like the ones that have an active subscription?

@mkangia Testing this out the query times out when trying to fetch all domans that have had a case update in the last year. To be able to fetch this we would need to implement date chunking here as well and I believe the complexity/cost of implementing these further additional queries outweighs the benefits we get from filtering out these extra domains.

mkangia · 2024-12-13T09:57:04Z

Hey @zandre-eng

Testing this out the query times out when trying to fetch all domans that have had a case update in the last year.

oh wow! I thought that that would perform better.
I see that you have tried to make things work in the best possible way with case ES. It does appear it will continue to face challenges as we get more and more cases in the system.

One suggestion I could think of is to instead use forms.
It seems FormES already supports filtering by submission date as seen here so it should work better.
May be its worth give it a go.

I believe you could also make this better for either approach by iterating with active subscriptions, and then find domains for those, and then look for their last case update or form submission flagging them as inactive if needed.
Basically, instead of fetching information from all cases, instead fetch it only for relevant domains/subscriptions.

Feel free to go with the approach currently in the PR if you think that will work well or keep them as a follow up. It does seem the requirement here is to find inactive domains and that can be done in different ways.

zandre-eng · 2024-12-13T10:32:28Z

oh wow! I thought that that would perform better.

@mkangia Docs in the case_search index are generally larger and more complex than that of the case index, so my suspicion is that this seems to be a big enough difference to affect performance quite a bit.

One suggestion I could think of is to instead use forms.
It seems FormES already supports filtering by submission date as seen here so it should work better.
May be its worth give it a go.

We could query using FormES with something like:

stale_date = datetime.now() - timedelta(days=365)
(
    FormES()
    .submitted(lt=stale_date)
    .aggregation(
        TermsAggregation('domain', 'domain.exact')
    )
    .size(0)
).run().aggregations.domain.counts_by_bucket().keys()

The only problem with the above however, is that we can't easily filter out forms that affect currently closed cases. This means querying for forms older than a year will return a lot more domains than we're interested in. We would need to further filter down this list by some other means, such as using the Subscription model. At that point, I feel the form query isn't adding much benefit.

I believe you could also make this better for either approach by iterating with active subscriptions, and then find domains for those, and then look for their last case update or form submission flagging them as inactive if needed.
Basically, instead of fetching information from all cases, instead fetch it only for relevant domains/subscriptions.

Would this not significantly increase the amount of queries we would need to do? The Subscription.get_active_domains_for_account() function does a query to fetch all the active subscription domains for an account. For each of the accounts containing active subscription domains we would need to run an aggregation query to identify which domains contain stale cases.

If we had 100 active accounts this would already result in 200 queries, which is more than the ~38 we're doing now. Please do let me know though if I misunderstood regarding the above.

kaapstorm

All good! Just a suggestion.

kaapstorm · 2024-12-13T10:31:54Z

corehq/apps/hqadmin/reports.py

+
+    def _aggregate_case_count_data(self):
+        end_date = datetime.now() - timedelta(days=self.STALE_DATE_THRESHOLD_DAYS)
+        agg_res = {}


No need to change what you have. Just a suggestion: If you do this ...

Suggested change

agg_res = {}

agg_res = defaultdict(lambda: 0)

... then you can do this ...

def _merge_agg_data(self, agg_res, query_res): for domain, case_count in query_res.items(): agg_res[domain] += case_count

This is a nice improvement! I completely forgot about the handy defaultdict type. I'll add this in.

Addressed in 25d9dc5

mkangia · 2024-12-16T10:04:26Z

Hey @zandre-eng

The only problem with the above however, is that we can't easily filter out forms that affect currently closed cases

Do you mean system forms?
If yes, those probably can be filtered out since they have the same XMLNS.

Would this not significantly increase the amount of queries we would need to do? The Subscription.get_active_domains_for_account() function does a query to fetch all the active subscription domains for an account. For each of the accounts containing active subscription domains we would need to run an aggregation query to identify which domains contain stale cases.

That is correct. More queries that could run fast and effectively. I am assuming that running through ES for a full search at once is going to be a pain over time.

I'll be honest that I have not considered all variables of this so I am just suggesting things from a high level understanding. Feel free to keep the original implementation if that is a better one.

zandre-eng · 2024-12-17T07:43:23Z

Hey @mkangia

Do you mean system forms?
If yes, those probably can be filtered out since they have the same XMLNS.

That is a good idea, however I'm still not aware of a way to easily filter out non-system forms that affect currently closed cases. I imagine this would still result in some additional queries to try and figure out which forms affect which cases.

That is correct. More queries that could run fast and effectively. I am assuming that running through ES for a full search at once is going to be a pain over time.

We're not entirely doing a full search, since we do fetch a list of active domains above the community plan before doing the ES queries. I believe these queries should get faster as domains start to close off their old cases.

I'll be honest that I have not considered all variables of this so I am just suggesting things from a high level understanding. Feel free to keep the original implementation if that is a better one.

Thanks for the brainstorming suggestions, these are some interesting ideas to think about. Given that the total query time is able to be completed in under a minute or two, and that this will be run very infrequently as an internal only report at off peak times, I'm going to leave the implementation as is for now. We can always do follow up if any of the mentioned conditions change.

mkangia · 2024-12-17T07:58:20Z

Sounds good to me @zandre-eng 👍
Thanks for discussing the approach

zandre-eng added 4 commits November 26, 2024 13:30

create report

76385bc

create periodic task

c742773

unit tests

eea21c5

nit: set size for aggregated query

305d37e

zandre-eng added the product/invisible Change has no end-user visible impact label Nov 26, 2024

zandre-eng requested review from kaapstorm, mkangia, Charl1996 and ajeety4 November 26, 2024 12:28

zandre-eng marked this pull request as ready for review November 26, 2024 12:40

zandre-eng requested review from orangejenny and biyeun as code owners November 26, 2024 12:40

kaapstorm reviewed Nov 29, 2024

View reviewed changes

ajeety4 reviewed Dec 12, 2024

View reviewed changes

zandre-eng added 8 commits December 12, 2024 10:57

aggregate query data with date chunks

96443c6

exception handling in task

a307084

process to csv file instead of datatable

9cf7aa6

cache row data

3d67720

only execute for production

19ddf0b

update unit tests

426bf32

make stop point dynamic

db65b56

correctly apply backoff amount

255e832

zandre-eng added 2 commits December 12, 2024 12:22

remove redundant text

adf92e0

correctly send attachment with email

d1eb95d

zandre-eng requested review from kaapstorm and ajeety4 December 12, 2024 11:10

ajeety4 reviewed Dec 13, 2024

View reviewed changes

corehq/apps/hqadmin/reports.py Outdated Show resolved Hide resolved

ajeety4 reviewed Dec 13, 2024

View reviewed changes

corehq/apps/hqadmin/reports.py Outdated Show resolved Hide resolved

ajeety4 reviewed Dec 13, 2024

View reviewed changes

corehq/apps/hqadmin/reports.py Outdated Show resolved Hide resolved

only show error message if relevant

ee92b43

minor fixes to query logic

7ec4e5b

zandre-eng force-pushed the ze/stale-case-data-mail-report branch from b228b40 to 7ec4e5b Compare December 13, 2024 09:31

kaapstorm approved these changes Dec 13, 2024

View reviewed changes

use defaultdict

25d9dc5

kaapstorm approved these changes Dec 13, 2024

View reviewed changes

zandre-eng merged commit d2e21b1 into master Dec 17, 2024
13 checks passed

zandre-eng deleted the ze/stale-case-data-mail-report branch December 17, 2024 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stale Case Data Mail Report #35446

Stale Case Data Mail Report #35446

zandre-eng commented Nov 26, 2024 •

edited

Loading

orangejenny commented Nov 26, 2024

kaapstorm left a comment

kaapstorm Nov 29, 2024 •

edited

Loading

zandre-eng Dec 11, 2024

kaapstorm Nov 29, 2024

zandre-eng Dec 11, 2024

ajeety4 left a comment

zandre-eng commented Dec 12, 2024

ajeety4 Dec 13, 2024

zandre-eng Dec 13, 2024

zandre-eng Dec 13, 2024

ajeety4 commented Dec 13, 2024

mkangia commented Dec 13, 2024

zandre-eng commented Dec 13, 2024

mkangia commented Dec 13, 2024

zandre-eng commented Dec 13, 2024

kaapstorm left a comment

kaapstorm Dec 13, 2024

zandre-eng Dec 13, 2024

zandre-eng Dec 13, 2024

mkangia commented Dec 16, 2024

zandre-eng commented Dec 17, 2024

mkangia commented Dec 17, 2024

		@staticmethod
		def format_as_table(row_data, data_tables_header):

Stale Case Data Mail Report #35446

Stale Case Data Mail Report #35446

Conversation

zandre-eng commented Nov 26, 2024 • edited Loading

Product Description

Technical Summary

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Rollback instructions

Labels & Review

orangejenny commented Nov 26, 2024

kaapstorm left a comment

Choose a reason for hiding this comment

kaapstorm Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajeety4 left a comment

Choose a reason for hiding this comment

zandre-eng commented Dec 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajeety4 commented Dec 13, 2024

mkangia commented Dec 13, 2024

zandre-eng commented Dec 13, 2024

mkangia commented Dec 13, 2024

zandre-eng commented Dec 13, 2024

kaapstorm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkangia commented Dec 16, 2024

zandre-eng commented Dec 17, 2024

mkangia commented Dec 17, 2024

zandre-eng commented Nov 26, 2024 •

edited

Loading

kaapstorm Nov 29, 2024 •

edited

Loading