Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce redundant validations on metadata instantiation #249

Open
wants to merge 5 commits into
base: patches
Choose a base branch
from

Conversation

gabelepoudre
Copy link

@gabelepoudre gabelepoudre commented Jan 9, 2025

This is my first PR in this repo, so please do point out any deficiencies in the request. I am always open to discussion!

Background

  • In using MTH5 and by extension this library, we have found some issues with performance. This has been most noticeable in our airborne processing, in which the number of "stations" can greatly exceed what would be normal in ground MT
  • Despite some hacks locally in our processing to greatly reduce data duplication in the airborne case by storing only the time indices (2 numbers) for every run in a station (as it relates to a flight timeseries), we noticed a surprising performance hits in MTH5's groups.run functions to_runts and from_runts. In these, one of the notable (but perhaps not the largest) hits in performance is the time spent generating metadata objects. In a light investigation, some of this slowness comes from erroneously/repeatedly validating the attr_dict objects which have already been validated on first import, and on each instantiation of the metadata.

Changes

  • Adds an argument skip_validation (default False) tracing from metadata.Base._set_attr_dict() through to the actual attribute setting on the base_object in helpers.recursive_split_setattr(). As "validation" also includes some amount of formatting, parsing, etc, this argument is only safe if you know that the incoming property or properties have already been validated. Fortunately this seems to always be the case when any subclass of metadata.Base calls super().init(), as the global-scope JSON parsing does this same validation when building each attr_dict. Therefore, we can skip re-validating the entire base schema for each object on every instantiation of a subclass.
  • Minor .gitignore update to avoid committing PyCharm files to the repo

Impact

  • Using the code tacked to the bottom of this PR description, the median time to instantiate a Station() decreased on my machine from 0.0012488s to 0.0003102s, which is an improvement of about ~4x.
  • I don't expect this to be a huge improvement to most processing flows, but in my (admittedly not perfect) survey of the slow MTH5 code, there is a surprising amount of instantiations especially in those schema's that have many sub-schema's, sub-sub-schema's, etc.
  • This change's location is just about as load bearing as it gets, and this change is expected to impact all consumers of the library. Appropriate cautions should be taken.

Testing

  • Have had some difficulty in getting tests to run in my local machine, even on unchanged code. I will be monitoring the GitHub Actions on this PR, and if issues come up plan to put more effort into replicating the issues locally.
  • Our airborne processing code has been able to run to completion for ~500 "stations" after the modifications to mt_metadata. This was done before a couple cleanup commits, but besides those the changes are at least not fundamentally broken for MTH5
  • The existing base of tests should do a good job testing this addition, due to where the change is and what it affects. A comprehensive test would be extremely difficult... but I am open to implementing any tests that could tease out potential edge cases, if requested

timing code:

import time
import numpy as np
from mt_metadata import timeseries

"""
0.0003101998008787632
0.0012488001957535744
"""

if __name__ == '__main__':
    x = 50000
    times = []
    for i in range(x):
        t0 = time.perf_counter()
        station = timeseries.Station()
        t1 = time.perf_counter()
        times.append(t1 - t0)
    m = np.median(times)
    print(f"Station(): {m} seconds")

@gabelepoudre
Copy link
Author

Thank you for running QA. From a quick survey the issue seems related to "_"-prefixed variables, which may be handled in a slightly different way. I will look into this.

@kujaku11
Copy link
Owner

kujaku11 commented Jan 24, 2025

@gabelepoudre This is great. I'll admit when I wrote this I wasn't focused on speed, and there can definitely be a number of speed ups. I could see having a "skip_validation" attribute would be beneficial to remove any additional reads.

I think going through how things are initiated will probably fix some of the tests. I can have a look this weekend.

Other thoughts would be to use Pydantic for validation instead of the current base code. Or initiate in a different way, like read in a single json. Not sure what the optimum way of initializing/validating is.

@gabelepoudre
Copy link
Author

@kujaku11
Sorry that this work hasn't moved very quickly, some other work is currently taking priority (still speed related).

To clarify, the change in this PR is likely the most significant hit to performance in mt_metadata right now, as far as I can tell. I can't comment on optimum, but as for better there is a couple things that I noted/tried while looking at this:

  • One of the disadvantages of the object-based (not class-based) schema objects is this inherent slowdown. Even while the approach taken in this PR speeds up the attribute setting, there is still some (small) hit to be reconstructing these objects more than once. This can almost be remedied at runtime (which is something I attempted) where the first-time initialization in Base would optionally use setattr on the class, not the object. This would in function make the class a definition of the default metadata (e.g. Station.some_field would be the default property which could be overwritten in any particular Station() object, but the Station() call would inherit the some_field property on instantiation, so Station's __init__ call would no longer need to send an attr_dict at all). However, this became a problem in the nested case, where one metadata's __init__ also defines some other instantiations of different metadata objects. In this case, attempting to set these attributes on the class would result in a shared sub-metadata across all instances, which is obviously not ideal (e.g. every Location would share a Latitude and Longitude instantiation). This could be resolved with a bit more focused work, possibly with an isinstance check of Base to make sure sub-metadata is added to each object and not each instance, and that the class is given a reference to the sub-metadata's class (NOT object) such that Location.latitude.hemisphere would refer to the default latitude.hemisphere. While technically compatible with the skip_validation changes here, it would have only a minor impact on the speed as it serves mostly the same purpose as the skip_validation field, which is deduping the init attributes
  • One also notable performance hit is 1-time, but would be considered non-ideal behaviour. The import of mt_metadata.timeseries will initialize all metadata in a non-lazy, order-important way. This includes reading all jsons, as well as the first time validation/parsing of each attr as it goes into the respective attr_dict objects. This take a very surprising ~2 seconds on my laptop on import, which can be very annoying for short scripts that may not even directly use the metadata, but imports from a module that has functions that do import from the metadata. The inter-dependence of the objects makes resolving this complex, but it would be better to do on-use lazy initialization, such that the definition json is not read until it is first initialized. This would mean that the performance hit would only occur when using these objects, and could mean that if you only use 50% of the metadata, only 50% is loaded during runtime. Further initializations should piggy back off of the first, so that you only pay that cost once
  • One (very painful to remedy!) bit of "technical debt" would be that the "validation" functions also do some amount of hard-to-track transformations of the input data. In an ideal world, "validate" methods would only be responsible for type-checking, bounds-checking, and value-checking, and setattr would also go through a transform()/parse() method. One benefit of this would be a (heavily-discouraged) switch to disable validation all together or in parts, to optionally speed up known-safe code. Another would be the option to override a given transform()/parse() method to have custom behaviour on special inputs. I'm not sure if I really recommend this change be done, just wanted to comment on it while it is being discussed

kujaku11 added a commit that referenced this pull request Feb 7, 2025
@kujaku11 kujaku11 mentioned this pull request Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants