Reduce redundant validations on metadata instantiation #249

gabelepoudre · 2025-01-09T18:50:38Z

This is my first PR in this repo, so please do point out any deficiencies in the request. I am always open to discussion!

Background

In using MTH5 and by extension this library, we have found some issues with performance. This has been most noticeable in our airborne processing, in which the number of "stations" can greatly exceed what would be normal in ground MT
Despite some hacks locally in our processing to greatly reduce data duplication in the airborne case by storing only the time indices (2 numbers) for every run in a station (as it relates to a flight timeseries), we noticed a surprising performance hits in MTH5's groups.run functions to_runts and from_runts. In these, one of the notable (but perhaps not the largest) hits in performance is the time spent generating metadata objects. In a light investigation, some of this slowness comes from erroneously/repeatedly validating the attr_dict objects which have already been validated on first import, and on each instantiation of the metadata.

Changes

Adds an argument skip_validation (default False) tracing from metadata.Base._set_attr_dict() through to the actual attribute setting on the base_object in helpers.recursive_split_setattr(). As "validation" also includes some amount of formatting, parsing, etc, this argument is only safe if you know that the incoming property or properties have already been validated. Fortunately this seems to always be the case when any subclass of metadata.Base calls super().init(), as the global-scope JSON parsing does this same validation when building each attr_dict. Therefore, we can skip re-validating the entire base schema for each object on every instantiation of a subclass.
Minor .gitignore update to avoid committing PyCharm files to the repo

Impact

Using the code tacked to the bottom of this PR description, the median time to instantiate a Station() decreased on my machine from 0.0012488s to 0.0003102s, which is an improvement of about ~4x.
I don't expect this to be a huge improvement to most processing flows, but in my (admittedly not perfect) survey of the slow MTH5 code, there is a surprising amount of instantiations especially in those schema's that have many sub-schema's, sub-sub-schema's, etc.
This change's location is just about as load bearing as it gets, and this change is expected to impact all consumers of the library. Appropriate cautions should be taken.

Testing

Have had some difficulty in getting tests to run in my local machine, even on unchanged code. I will be monitoring the GitHub Actions on this PR, and if issues come up plan to put more effort into replicating the issues locally.
Our airborne processing code has been able to run to completion for ~500 "stations" after the modifications to mt_metadata. This was done before a couple cleanup commits, but besides those the changes are at least not fundamentally broken for MTH5
The existing base of tests should do a good job testing this addition, due to where the change is and what it affects. A comprehensive test would be extremely difficult... but I am open to implementing any tests that could tease out potential edge cases, if requested

timing code:

import time
import numpy as np
from mt_metadata import timeseries

"""
0.0003101998008787632
0.0012488001957535744
"""

if __name__ == '__main__':
    x = 50000
    times = []
    for i in range(x):
        t0 = time.perf_counter()
        station = timeseries.Station()
        t1 = time.perf_counter()
        times.append(t1 - t0)
    m = np.median(times)
    print(f"Station(): {m} seconds")

minor changes

gabelepoudre · 2025-01-14T15:13:15Z

Thank you for running QA. From a quick survey the issue seems related to "_"-prefixed variables, which may be handled in a slightly different way. I will look into this.

kujaku11 · 2025-01-24T19:18:18Z

@gabelepoudre This is great. I'll admit when I wrote this I wasn't focused on speed, and there can definitely be a number of speed ups. I could see having a "skip_validation" attribute would be beneficial to remove any additional reads.

I think going through how things are initiated will probably fix some of the tests. I can have a look this weekend.

Other thoughts would be to use Pydantic for validation instead of the current base code. Or initiate in a different way, like read in a single json. Not sure what the optimum way of initializing/validating is.

gabelepoudre · 2025-01-27T18:17:28Z

@kujaku11
Sorry that this work hasn't moved very quickly, some other work is currently taking priority (still speed related).

To clarify, the change in this PR is likely the most significant hit to performance in mt_metadata right now, as far as I can tell. I can't comment on optimum, but as for better there is a couple things that I noted/tried while looking at this:

One of the disadvantages of the object-based (not class-based) schema objects is this inherent slowdown. Even while the approach taken in this PR speeds up the attribute setting, there is still some (small) hit to be reconstructing these objects more than once. This can almost be remedied at runtime (which is something I attempted) where the first-time initialization in Base would optionally use setattr on the class, not the object. This would in function make the class a definition of the default metadata (e.g. Station.some_field would be the default property which could be overwritten in any particular Station() object, but the Station() call would inherit the some_field property on instantiation, so Station's __init__ call would no longer need to send an attr_dict at all). However, this became a problem in the nested case, where one metadata's __init__ also defines some other instantiations of different metadata objects. In this case, attempting to set these attributes on the class would result in a shared sub-metadata across all instances, which is obviously not ideal (e.g. every Location would share a Latitude and Longitude instantiation). This could be resolved with a bit more focused work, possibly with an isinstance check of Base to make sure sub-metadata is added to each object and not each instance, and that the class is given a reference to the sub-metadata's class (NOT object) such that Location.latitude.hemisphere would refer to the default latitude.hemisphere. While technically compatible with the skip_validation changes here, it would have only a minor impact on the speed as it serves mostly the same purpose as the skip_validation field, which is deduping the init attributes
One also notable performance hit is 1-time, but would be considered non-ideal behaviour. The import of mt_metadata.timeseries will initialize all metadata in a non-lazy, order-important way. This includes reading all jsons, as well as the first time validation/parsing of each attr as it goes into the respective attr_dict objects. This take a very surprising ~2 seconds on my laptop on import, which can be very annoying for short scripts that may not even directly use the metadata, but imports from a module that has functions that do import from the metadata. The inter-dependence of the objects makes resolving this complex, but it would be better to do on-use lazy initialization, such that the definition json is not read until it is first initialized. This would mean that the performance hit would only occur when using these objects, and could mean that if you only use 50% of the metadata, only 50% is loaded during runtime. Further initializations should piggy back off of the first, so that you only pay that cost once
One (very painful to remedy!) bit of "technical debt" would be that the "validation" functions also do some amount of hard-to-track transformations of the input data. In an ideal world, "validate" methods would only be responsible for type-checking, bounds-checking, and value-checking, and setattr would also go through a transform()/parse() method. One benefit of this would be a (heavily-discouraged) switch to disable validation all together or in parts, to optionally speed up known-safe code. Another would be the option to override a given transform()/parse() method to have custom behaviour on special inputs. I'm not sure if I really recommend this change be done, just wanted to comment on it while it is being discussed

kujaku11 and others added 5 commits January 3, 2025 15:48

Merge pull request kujaku11#234 from kujaku11/patches

edeb5c6

minor changes

speed impovement

6ed7d29

fix?

984cf90

cleanup

9bfb60a

doc

17fee02

kujaku11 added a commit that referenced this pull request Feb 7, 2025

applying changes in PR #249

746c247

kujaku11 mentioned this pull request Feb 7, 2025

Follow on PR #249 #251

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce redundant validations on metadata instantiation #249

Reduce redundant validations on metadata instantiation #249

gabelepoudre commented Jan 9, 2025 •

edited

Loading

gabelepoudre commented Jan 14, 2025

kujaku11 commented Jan 24, 2025 •

edited

Loading

gabelepoudre commented Jan 27, 2025

Reduce redundant validations on metadata instantiation #249

Are you sure you want to change the base?

Reduce redundant validations on metadata instantiation #249

Conversation

gabelepoudre commented Jan 9, 2025 • edited Loading

Background

Changes

Impact

Testing

gabelepoudre commented Jan 14, 2025

kujaku11 commented Jan 24, 2025 • edited Loading

gabelepoudre commented Jan 27, 2025

gabelepoudre commented Jan 9, 2025 •

edited

Loading

kujaku11 commented Jan 24, 2025 •

edited

Loading