Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale Case Data Mail Report #35446

Merged
merged 17 commits into from
Dec 17, 2024
Merged

Stale Case Data Mail Report #35446

merged 17 commits into from
Dec 17, 2024

Conversation

zandre-eng
Copy link
Contributor

@zandre-eng zandre-eng commented Nov 26, 2024

Product Description

No user-facing changes. This new report is only sent out via mail and is not included in HQ's UI.

Technical Summary

Link to ticket here.

A new periodic Celery task is created with the responsibility of sending out a email report to the SOLUTIONS_AES_EMAIL email address on a monthly basis. This is a simple report that aggregates case counts for domains that have open cases that have not been modified in over a year.

The query logic itself has date chunking implemented, which essentially means the queries will be made in date slices of 180 days at a time. The slices will go back 20 years in these slices. This specific date has been picked as prod testing has shown that this is about where the oldest cases reside.

Additionally, the query logic also implements backoff logic. Essentially, if the query times out then the date slice will be made thinner by 30 days. If this happens more than 2 times then the table will stop aggregating data entirely and will simply return the data it has aggregated so far. This logic has been implemented specifically with production's scale in mind, as the query can sometimes time out on the 1-2 most recent time slices due to the amount of data there is to aggregate.

The goal of this report is to identify which domains could be potential candidates for auto-update rules to be implemented.

Feature Flag

None

Safety Assurance

Safety story

  • Local testing
  • Unit tests

The Celery task has also been tested on the staging environment to ensure that the email and attachment are correctly sent.

Automated test coverage

Unit tests exist for the new StaleCasesTable class and test to make sure that it retrieves the correct data and formats it appropriately.

QA Plan

No QA planned.

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@zandre-eng zandre-eng added the product/invisible Change has no end-user visible impact label Nov 26, 2024
@zandre-eng zandre-eng marked this pull request as ready for review November 26, 2024 12:40
@orangejenny
Copy link
Contributor

Code looks fine, though I only skimmed it. Do you know how many domains currently meet this criteria on prod? I wonder if it would be helpful to filter them, possibly by subscription, to domains that solutions is most likely to take action on.

Copy link
Contributor

@kaapstorm kaapstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice. Just a couple of suggestions.

)

@property
def rows(self):
Copy link
Contributor

@kaapstorm kaapstorm Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using table.rows twice will run the query twice. How about something like this?

    def __init__():
        self._rows = None

    @property
    def rows(self):
        if self._rows is None:
            self._rows = []
            # etc.
        return self._rows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea! Currently the "caching" is done in the task itself by calling table_rows = StaleCasesTable().rows and then using the table_rows var locally, but the above suggestion will remove the need for storing the rows locally in the task.

Comment on lines 538 to 539
@staticmethod
def format_as_table(row_data, data_tables_header):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a useful function, but it is more generic than this class, and it expects its headers to be a list of DataTablesHeader instances. For those reasons, I would pull it out of here, move it somewhere under corehq/apps/reports/datatables/, and maybe rename it. If you feel that having a method is useful, you could change this to something like:

    def format_as_table(self):
        return datatable_as_text(self.rows, self.headers)

(Obviously assumes that self.rows doesn't run the query again.)

(Not applicable here because it's probably not worth an additional requirement, but useful to know about: Tablib is cool, and it can do this. (Also, I'm biased, because I wrote that bit.))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. We no longer need this function as I'll be refactoring the task to save the data to a csv file instead. I'm therefore going to completely remove this method.
I thought it might be useful to keep it around and put it in corehq/apps/reports/datatables as you suggested for future use but, in saying that, YAGNI.

Copy link
Contributor

@ajeety4 ajeety4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@zandre-eng
Copy link
Contributor Author

To make the ES query performant enough and consistently successful I have refactored how the report does its query a bit. Now the ES query will aggregate data in date slices of 180 days at a time.
I have also implemented backoff logic so that the date slices become thinner if the query times out. When testing on production I have noted that this may happen and looks to be dependent on whether the index has been cached in some way. The worst case scenario I have witnessed for production is that it might time out the first 1-2 queries before successfully completing all of them.

Please do let me know if you have any thoughts/concerns with the above strategy, and am happy to discuss alternatives as well.

f'{table.STALE_DATE_THRESHOLD_DAYS} days.'
)
if has_error:
message += (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In this scenario, I think it is probably better to emphasize on the error message first and then speak about no domains found in the partial compilation (or just keep the error message). I say this because I am not sure how relevant is the above message to the end user in case of error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. This could potentially be confusing to have a message that says no domains found followed by one that says there was an error. I'll refactor this so that the default no domains message is only shown if there wasn't an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in ee92b43

@ajeety4
Copy link
Contributor

ajeety4 commented Dec 13, 2024

To make the ES query performant enough and consistently successful I have refactored how the report does its query a bit. Now the ES query will aggregate data in date slices of 180 days at a time. I have also implemented backoff logic so that the date slices become thinner if the query times out. When testing on production I have noted that this may happen and looks to be dependent on whether the index has been cached in some way. The worst case scenario I have witnessed for production is that it might time out the first 1-2 queries before successfully completing all of them.

Please do let me know if you have any thoughts/concerns with the above strategy, and am happy to discuss alternatives as well.

A great idea to split this into days. The other alternative to split is probably domains however days seems like a better candidate to achieve an even distribution for the split. Good thinking on having the backoff strategy to make it more reliable.
I don't foresee any concerns and this does seem like a ramp up to the previous approach.

@mkangia
Copy link
Contributor

mkangia commented Dec 13, 2024

Hey @zandre-eng

Just checking if we considered the opposite appraoch?
fetch domains that have a case update in last one year and then filter those from a list of relevant domains like the ones that have an active subscription?

@zandre-eng
Copy link
Contributor Author

Just checking if we considered the opposite approach?
fetch domains that have a case update in last one year and then filter those from a list of relevant domains like the ones that have an active subscription?

@mkangia Testing this out the query times out when trying to fetch all domans that have had a case update in the last year. To be able to fetch this we would need to implement date chunking here as well and I believe the complexity/cost of implementing these further additional queries outweighs the benefits we get from filtering out these extra domains.

@zandre-eng zandre-eng force-pushed the ze/stale-case-data-mail-report branch from b228b40 to 7ec4e5b Compare December 13, 2024 09:31
@mkangia
Copy link
Contributor

mkangia commented Dec 13, 2024

Hey @zandre-eng

Testing this out the query times out when trying to fetch all domans that have had a case update in the last year.

oh wow! I thought that that would perform better.
I see that you have tried to make things work in the best possible way with case ES. It does appear it will continue to face challenges as we get more and more cases in the system.

One suggestion I could think of is to instead use forms.
It seems FormES already supports filtering by submission date as seen here so it should work better.
May be its worth give it a go.

I believe you could also make this better for either approach by iterating with active subscriptions, and then find domains for those, and then look for their last case update or form submission flagging them as inactive if needed.
Basically, instead of fetching information from all cases, instead fetch it only for relevant domains/subscriptions.

Feel free to go with the approach currently in the PR if you think that will work well or keep them as a follow up. It does seem the requirement here is to find inactive domains and that can be done in different ways.

@zandre-eng
Copy link
Contributor Author

oh wow! I thought that that would perform better.

@mkangia Docs in the case_search index are generally larger and more complex than that of the case index, so my suspicion is that this seems to be a big enough difference to affect performance quite a bit.

One suggestion I could think of is to instead use forms.
It seems FormES already supports filtering by submission date as seen here so it should work better.
May be its worth give it a go.

We could query using FormES with something like:

stale_date = datetime.now() - timedelta(days=365)
(
    FormES()
    .submitted(lt=stale_date)
    .aggregation(
        TermsAggregation('domain', 'domain.exact')
    )
    .size(0)
).run().aggregations.domain.counts_by_bucket().keys()

The only problem with the above however, is that we can't easily filter out forms that affect currently closed cases. This means querying for forms older than a year will return a lot more domains than we're interested in. We would need to further filter down this list by some other means, such as using the Subscription model. At that point, I feel the form query isn't adding much benefit.

I believe you could also make this better for either approach by iterating with active subscriptions, and then find domains for those, and then look for their last case update or form submission flagging them as inactive if needed.
Basically, instead of fetching information from all cases, instead fetch it only for relevant domains/subscriptions.

Would this not significantly increase the amount of queries we would need to do? The Subscription.get_active_domains_for_account() function does a query to fetch all the active subscription domains for an account. For each of the accounts containing active subscription domains we would need to run an aggregation query to identify which domains contain stale cases.

If we had 100 active accounts this would already result in 200 queries, which is more than the ~38 we're doing now. Please do let me know though if I misunderstood regarding the above.

Copy link
Contributor

@kaapstorm kaapstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good! Just a suggestion.


def _aggregate_case_count_data(self):
end_date = datetime.now() - timedelta(days=self.STALE_DATE_THRESHOLD_DAYS)
agg_res = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change what you have. Just a suggestion: If you do this ...

Suggested change
agg_res = {}
agg_res = defaultdict(lambda: 0)

... then you can do this ...

    def _merge_agg_data(self, agg_res, query_res):
        for domain, case_count in query_res.items():
            agg_res[domain] += case_count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice improvement! I completely forgot about the handy defaultdict type. I'll add this in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 25d9dc5

@mkangia
Copy link
Contributor

mkangia commented Dec 16, 2024

Hey @zandre-eng

The only problem with the above however, is that we can't easily filter out forms that affect currently closed cases

Do you mean system forms?
If yes, those probably can be filtered out since they have the same XMLNS.

Would this not significantly increase the amount of queries we would need to do? The Subscription.get_active_domains_for_account() function does a query to fetch all the active subscription domains for an account. For each of the accounts containing active subscription domains we would need to run an aggregation query to identify which domains contain stale cases.

That is correct. More queries that could run fast and effectively. I am assuming that running through ES for a full search at once is going to be a pain over time.

I'll be honest that I have not considered all variables of this so I am just suggesting things from a high level understanding. Feel free to keep the original implementation if that is a better one.

@zandre-eng
Copy link
Contributor Author

Hey @mkangia

Do you mean system forms?
If yes, those probably can be filtered out since they have the same XMLNS.

That is a good idea, however I'm still not aware of a way to easily filter out non-system forms that affect currently closed cases. I imagine this would still result in some additional queries to try and figure out which forms affect which cases.

That is correct. More queries that could run fast and effectively. I am assuming that running through ES for a full search at once is going to be a pain over time.

We're not entirely doing a full search, since we do fetch a list of active domains above the community plan before doing the ES queries. I believe these queries should get faster as domains start to close off their old cases.

I'll be honest that I have not considered all variables of this so I am just suggesting things from a high level understanding. Feel free to keep the original implementation if that is a better one.

Thanks for the brainstorming suggestions, these are some interesting ideas to think about. Given that the total query time is able to be completed in under a minute or two, and that this will be run very infrequently as an internal only report at off peak times, I'm going to leave the implementation as is for now. We can always do follow up if any of the mentioned conditions change.

@mkangia
Copy link
Contributor

mkangia commented Dec 17, 2024

Sounds good to me @zandre-eng 👍
Thanks for discussing the approach

@zandre-eng zandre-eng merged commit d2e21b1 into master Dec 17, 2024
13 checks passed
@zandre-eng zandre-eng deleted the ze/stale-case-data-mail-report branch December 17, 2024 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product/invisible Change has no end-user visible impact
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants