Do not store frequently updated foreign key relation on model #14149

kdelee · 2023-06-20T17:48:32Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.

Feature type

Enhancement to Existing Feature

Feature Summary

On several models, we store information about a relationship to a foreign key that is frequently updated, and potentially risky for deadlocks. By saving this foreign key as a field on the model, we are optimizing read performance for information that is probably more frequently written than it is read.

e.g. we are optimizing for read performance when we probably ought to to be optimizing for write performance.

To be specific:

awx/awx/main/models/inventory.py

Lines 554 to 562 in fb8fadc

    
           last_job_host_summary = models.ForeignKey( 
        
               'JobHostSummary', 
        
               related_name='hosts_as_last_job_summary+', 
        
               blank=True, 
        
               null=True, 
        
               default=None, 
        
               editable=False, 
        
               on_delete=models.SET_NULL, 
        
           )

awx/awx/main/models/unified_jobs.py

Lines 126 to 137 in fb8fadc

    
           last_job = models.ForeignKey( 
        
               'UnifiedJob', 
        
               null=True, 
        
               default=None, 
        
               editable=False, 
        
               related_name='%(class)s_as_last_job+', 
        
               on_delete=models.SET_NULL, 
        
           ) 
        
           last_job_failed = models.BooleanField( 
        
               default=False, 
        
               editable=False, 
        
           )

awx/awx/main/models/unified_jobs.py

Lines 118 to 125 in fb8fadc

    
           current_job = models.ForeignKey( 
        
               'UnifiedJob', 
        
               null=True, 
        
               default=None, 
        
               editable=False, 
        
               related_name='%(class)s_as_current_job+', 
        
               on_delete=models.SET_NULL, 
        
           )

There is also:

awx/awx/main/models/unified_jobs.py

Lines 138 to 151 in fb8fadc

    
           last_job_run = models.DateTimeField( 
        
               null=True, 
        
               default=None, 
        
               editable=False, 
        
           ) 
        
           # on_missed_schedule = models.CharField( 
        
           #    max_length=32, 
        
           #    choices=[], 
        
           # ) 
        
           next_job_run = models.DateTimeField( 
        
               null=True, 
        
               default=None, 
        
               editable=False, 
        
           )

But that starts to get into schedules and I think requires further thinking.

For last_job_host_summary, last_job, current_job, and last_job_failed I propose we are doing more harm than good by storing these on the model. This is because job templates can have many jobs launched from them, and hosts can have many jobs running against them.

This causes us to run into potential deadlocks when multiple processes attempt to update same attribute on single hosts and job templates. I have personally observed this when testing with high volumes of small jobs against a few hosts. Increasing number of callback receivers increases possibility of running into this.

Problem areas:

awx/awx/main/models/events.py

Lines 588 to 601 in fb8fadc

    
           # update the last_job_id and last_job_host_summary_id 
        
           # in single queries 
        
           host_mapping = dict((summary['host_id'], summary['id']) for summary in JobHostSummary.objects.filter(job_id=job.id).values('id', 'host_id')) 
        
           updated_hosts = set() 
        
           for h in all_hosts: 
        
               # if the hostname *shows up* in the playbook_on_stats event 
        
               if h.name in hostnames: 
        
                   h.last_job_id = job.id 
        
                   updated_hosts.add(h) 
        
               if h.id in host_mapping: 
        
                   h.last_job_host_summary_id = host_mapping[h.id] 
        
                   updated_hosts.add(h) 
        
           Host.objects.bulk_update(list(updated_hosts), ['last_job_id', 'last_job_host_summary_id'], batch_size=100)

awx/awx/main/signals.py

Lines 286 to 306 in fb8fadc

    
           def _update_host_last_jhs(host): 
        
               jhs_qs = JobHostSummary.objects.filter(host__pk=host.pk) 
        
               try: 
        
                   jhs = jhs_qs.order_by('-job__pk')[0] 
        
               except IndexError: 
        
                   jhs = None 
        
               update_fields = [] 
        
               try: 
        
                   last_job = jhs.job if jhs else None 
        
               except Job.DoesNotExist: 
        
                   # The job (and its summaries) have already been/are currently being 
        
                   # deleted, so there's no need to update the host w/ a reference to it 
        
                   return 
        
               if host.last_job != last_job: 
        
                   host.last_job = last_job 
        
                   update_fields.append('last_job') 
        
               if host.last_job_host_summary != jhs: 
        
                   host.last_job_host_summary = jhs 
        
                   update_fields.append('last_job_host_summary') 
        
               if update_fields: 
        
                   host.save(update_fields=update_fields)

awx/awx/main/models/unified_jobs.py

Lines 815 to 826 in fb8fadc

    
           if parent_instance: 
        
               if self.status in ('pending', 'waiting', 'running'): 
        
                   if parent_instance.current_job != self: 
        
                       parent_instance_set('current_job', self) 
        
                   # Update parent with all the 'good' states of it's child 
        
                   if parent_instance.status != self.status: 
        
                       parent_instance_set('status', self.status) 
        
               elif self.status in ('successful', 'failed', 'error', 'canceled'): 
        
                   if parent_instance.current_job == self: 
        
                       parent_instance_set('current_job', None) 
        
                   parent_instance_set('last_job', self) 
        
                   parent_instance_set('last_job_failed', self.failed)

Proposed alternative:
These values can be quickly calculated in the serializer or as a property on the job, the query would be quite cheap, something along lines of (psuedocode)

@property
def last_job(self):
    return UnifiedJobs.objects.filter(job_template_id=self.id, status__in=['successful', 'failed', 'error', 'canceled']).latest()

This uses https://docs.djangoproject.com/en/dev/ref/models/querysets/#latest but there is equivalent ways to do it if thats not available in our version of django

Select the relevant components

Steps to reproduce

One way to force this would be to run many (hundreds or thousands) of small jobs from 1 job template (allow concurrent jobs) (have playbook have just 1 task) against 1 host. Increase the number of callback receiver workers to exacerbate.

Current results

Deadlocks

Sugested feature result

Calculate these values on read
Should not show in the list view as that would be expensive (only show in detail view)

Additional information

No response

The text was updated successfully, but these errors were encountered:

AlanCoding · 2023-06-20T18:42:38Z

I am trying to muck with these fields already in #13553 for performance reasons, and have asked around before about getting performance validation of this to clear it for merge.

The issue that motivated that change was about deadlocks, but in that case it was the task manager acting on an approval job and deadlocking deterministically, not as a race condition. The full reason was not identified, but I believe it was a state inconsistency in the task manager inside of a single run, probably as multiple approval jobs are changing state at the same time. The thing which was so frustrating was the the model .save code was doing a tremendous number of round-trips to the database to load objects which should already be present, but programmers do this defensively in distrust of the current model, not realizing that it can later on lead to even more insidious problems because changes which should be no-op are not identified correctly, and transactional conflicts arise on the database layer.

These values can be quickly calculated in the serializer or as a property on the job

While I agree something should be done, this is a non-starter because of the consequences for serializer performance. The challenge isn't so-much Django, but the django-polymorphic library we use. The query you give is more-or-less impossible to prefetch, as we have ~6 types, and we would hit all those tables to prefetch (even if the notorious jazzband/django-polymorphic#198 had a clean resolution), but we have on the order of a dozen similar fields, making the benefit marginal or negative compared to no optimization.

I would love to grant our API the freedom to remove fields like these, but there is a strong user experience case to be made for a field like last_job, as the user needs to be able to see the current status of templates in the list. More importantly, we have failed to recapture performance gains from things the UI has already removed, specifically summary_fields.recent_jobs.

The branch I had here was specifically put together to combine multiple non-interference improvements for model updates. I believe this is addressing the concerns you cite, but it is not attempting to make any change to the existing API contract, instead doing only purely internal changes:

https://github.com/ansible/awx/compare/devel...AlanCoding:awx:perf_pool?expand=1

github-actions bot added component:api needs_triage type:enhancement labels Jun 20, 2023

djyasin removed the needs_triage label Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not store frequently updated foreign key relation on model #14149

Do not store frequently updated foreign key relation on model #14149

kdelee commented Jun 20, 2023 •

edited

Loading

AlanCoding commented Jun 20, 2023

Do not store frequently updated foreign key relation on model #14149

Do not store frequently updated foreign key relation on model #14149

Comments

kdelee commented Jun 20, 2023 • edited Loading

Please confirm the following

Feature type

Feature Summary

Select the relevant components

Steps to reproduce

Current results

Sugested feature result

Additional information

AlanCoding commented Jun 20, 2023

kdelee commented Jun 20, 2023 •

edited

Loading