Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not store frequently updated foreign key relation on model #14149

Open
4 of 9 tasks
kdelee opened this issue Jun 20, 2023 · 1 comment
Open
4 of 9 tasks

Do not store frequently updated foreign key relation on model #14149

kdelee opened this issue Jun 20, 2023 · 1 comment

Comments

@kdelee
Copy link
Member

kdelee commented Jun 20, 2023

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Feature type

Enhancement to Existing Feature

Feature Summary

On several models, we store information about a relationship to a foreign key that is frequently updated, and potentially risky for deadlocks. By saving this foreign key as a field on the model, we are optimizing read performance for information that is probably more frequently written than it is read.

e.g. we are optimizing for read performance when we probably ought to to be optimizing for write performance.

To be specific:

last_job_host_summary = models.ForeignKey(
'JobHostSummary',
related_name='hosts_as_last_job_summary+',
blank=True,
null=True,
default=None,
editable=False,
on_delete=models.SET_NULL,
)

last_job = models.ForeignKey(
'UnifiedJob',
null=True,
default=None,
editable=False,
related_name='%(class)s_as_last_job+',
on_delete=models.SET_NULL,
)
last_job_failed = models.BooleanField(
default=False,
editable=False,
)

current_job = models.ForeignKey(
'UnifiedJob',
null=True,
default=None,
editable=False,
related_name='%(class)s_as_current_job+',
on_delete=models.SET_NULL,
)

There is also:

last_job_run = models.DateTimeField(
null=True,
default=None,
editable=False,
)
# on_missed_schedule = models.CharField(
# max_length=32,
# choices=[],
# )
next_job_run = models.DateTimeField(
null=True,
default=None,
editable=False,
)
But that starts to get into schedules and I think requires further thinking.

For last_job_host_summary, last_job, current_job, and last_job_failed I propose we are doing more harm than good by storing these on the model. This is because job templates can have many jobs launched from them, and hosts can have many jobs running against them.

This causes us to run into potential deadlocks when multiple processes attempt to update same attribute on single hosts and job templates. I have personally observed this when testing with high volumes of small jobs against a few hosts. Increasing number of callback receivers increases possibility of running into this.

Problem areas:

# update the last_job_id and last_job_host_summary_id
# in single queries
host_mapping = dict((summary['host_id'], summary['id']) for summary in JobHostSummary.objects.filter(job_id=job.id).values('id', 'host_id'))
updated_hosts = set()
for h in all_hosts:
# if the hostname *shows up* in the playbook_on_stats event
if h.name in hostnames:
h.last_job_id = job.id
updated_hosts.add(h)
if h.id in host_mapping:
h.last_job_host_summary_id = host_mapping[h.id]
updated_hosts.add(h)
Host.objects.bulk_update(list(updated_hosts), ['last_job_id', 'last_job_host_summary_id'], batch_size=100)

awx/awx/main/signals.py

Lines 286 to 306 in fb8fadc

def _update_host_last_jhs(host):
jhs_qs = JobHostSummary.objects.filter(host__pk=host.pk)
try:
jhs = jhs_qs.order_by('-job__pk')[0]
except IndexError:
jhs = None
update_fields = []
try:
last_job = jhs.job if jhs else None
except Job.DoesNotExist:
# The job (and its summaries) have already been/are currently being
# deleted, so there's no need to update the host w/ a reference to it
return
if host.last_job != last_job:
host.last_job = last_job
update_fields.append('last_job')
if host.last_job_host_summary != jhs:
host.last_job_host_summary = jhs
update_fields.append('last_job_host_summary')
if update_fields:
host.save(update_fields=update_fields)

if parent_instance:
if self.status in ('pending', 'waiting', 'running'):
if parent_instance.current_job != self:
parent_instance_set('current_job', self)
# Update parent with all the 'good' states of it's child
if parent_instance.status != self.status:
parent_instance_set('status', self.status)
elif self.status in ('successful', 'failed', 'error', 'canceled'):
if parent_instance.current_job == self:
parent_instance_set('current_job', None)
parent_instance_set('last_job', self)
parent_instance_set('last_job_failed', self.failed)

Proposed alternative:
These values can be quickly calculated in the serializer or as a property on the job, the query would be quite cheap, something along lines of (psuedocode)

@property
def last_job(self):
    return UnifiedJobs.objects.filter(job_template_id=self.id, status__in=['successful', 'failed', 'error', 'canceled']).latest()

This uses https://docs.djangoproject.com/en/dev/ref/models/querysets/#latest but there is equivalent ways to do it if thats not available in our version of django

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Steps to reproduce

One way to force this would be to run many (hundreds or thousands) of small jobs from 1 job template (allow concurrent jobs) (have playbook have just 1 task) against 1 host. Increase the number of callback receiver workers to exacerbate.

Current results

Deadlocks

Sugested feature result

Calculate these values on read
Should not show in the list view as that would be expensive (only show in detail view)

Additional information

No response

@AlanCoding
Copy link
Member

I am trying to muck with these fields already in #13553 for performance reasons, and have asked around before about getting performance validation of this to clear it for merge.

The issue that motivated that change was about deadlocks, but in that case it was the task manager acting on an approval job and deadlocking deterministically, not as a race condition. The full reason was not identified, but I believe it was a state inconsistency in the task manager inside of a single run, probably as multiple approval jobs are changing state at the same time. The thing which was so frustrating was the the model .save code was doing a tremendous number of round-trips to the database to load objects which should already be present, but programmers do this defensively in distrust of the current model, not realizing that it can later on lead to even more insidious problems because changes which should be no-op are not identified correctly, and transactional conflicts arise on the database layer.

These values can be quickly calculated in the serializer or as a property on the job

While I agree something should be done, this is a non-starter because of the consequences for serializer performance. The challenge isn't so-much Django, but the django-polymorphic library we use. The query you give is more-or-less impossible to prefetch, as we have ~6 types, and we would hit all those tables to prefetch (even if the notorious jazzband/django-polymorphic#198 had a clean resolution), but we have on the order of a dozen similar fields, making the benefit marginal or negative compared to no optimization.

I would love to grant our API the freedom to remove fields like these, but there is a strong user experience case to be made for a field like last_job, as the user needs to be able to see the current status of templates in the list. More importantly, we have failed to recapture performance gains from things the UI has already removed, specifically summary_fields.recent_jobs.

The branch I had here was specifically put together to combine multiple non-interference improvements for model updates. I believe this is addressing the concerns you cite, but it is not attempting to make any change to the existing API contract, instead doing only purely internal changes:

https://github.com/ansible/awx/compare/devel...AlanCoding:awx:perf_pool?expand=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants