-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential speedups in pipeline runtime #35
Comments
The issue was solved by adding a dictionary which stores the scanline, ellipse, and convex hull for each plant. (PR #44 ) |
The testing issue was fixed by adding the function which appending results to the cache dictionary for the test functions of different dataset |
This way will save more than 1/2 of calculation time (2.781 vs. 6.589 seconds for 5 rice plants using my desktop)! |
We should refactor the traits pipeline to actually use the trait map as a computation graph. Currently it just serves as metadata, but we don't use the graph structure at all -- we're in fact recomputing intermediate traits everywhere since all of the inputs to all of the traits are just the original primary and/or lateral points. In trait_computation_order = ... # Find breadth-first ordering of trait graph.
# Initialize traits container with initial points.
traits = {"pts": pts}
# Compute each trait.
for trait_name in trait_computation_order:
fn, input_traits, kwargs = trait_map[trait_name]
fn_outputs = fn(*[traits[input_trait] for input_trait in input_traits], **kwargs)
traits[trait_name] = fn_outputs Re-define the traits map so it's formatted as: {
"trait_name": (function, ["input_trait1", "input_trait2", ...], {"additional_kwarg1": True, "additional_kwarg2": 0.5})
} For example, for a derived convhull feature, we would specify it as: {
"chull_area": (get_chull_area, ["convex_hull"], {}),
} This indicates that the The {
"scanline_intersection_counts": (
count_scanline_intersections,
["primary_pts", "lateral_pts"], # these come from traits dict
{"height": 1080, "width": 2048, "n_line": 50, "monocots": monocots} # these are fixed inputs that don't depend on the trait graph
),
} This format would also make edges = []
for output_trait, (_, input_traits, _) in trait_map.items():
for input_trait in input_traits:
edges.append((input_trait, output_trait)) Putting it all together: # Define trait map.
trait_map = {
# Ignore these if precomputed already:
# "primary_pts": (get_primary_pts, ["pts"], {}),
# "lateral_pts": (get_lateral_pts, ["pts"], {}),
# ...
"scanline_intersection_counts": (
count_scanline_intersections,
["primary_pts", "lateral_pts"], # these come from traits dict
{"height": 1080, "width": 2048, "n_line": 50, "monocots": monocots} # these are fixed inputs that don't depend on the trait graph
),
# ...
"chull_area": (get_chull_area, ["convex_hull"], {}),
# ...
}
# Initialize edges with precomputed top-level traits.
edges = [("pts", "primary_pts"), ("pts", "lateral_pts")]
# Infer edges from trait map.
for output_trait, (_, input_traits, _) in trait_map.items():
for input_trait in input_traits:
edges.append((input_trait, output_trait))
# Compute breadth-first ordering.
G = nx.DiGraph()
G.add_edges_from(edges)
trait_computation_order = [dst for (src, dst) in list(nx.bfs_tree(G, "pts").edges())[2:]]
# Initialize traits container with initial points.
traits = {"primary_pts": primary_pts, "lateral_pts": lateral_pts}
# Compute each trait.
for trait_name in trait_computation_order:
if trait_name in traits:
# Ignore traits that are already computed.
continue
fn, input_traits, kwargs = trait_map[trait_name]
fn_outputs = fn(*[traits[input_trait] for input_trait in input_traits], **kwargs)
traits[trait_name] = fn_outputs |
btw, pls rename |
Doing some profiling, we see that we may be able to shave off ~20-40% by reducing repeated calls to convex hull and ellipse fitting routines.
Here's a profiling script:
This runs the pipeline without saving the CSVs for 5 plants.
You can profile by saving the above script to
get_traits.py
and:Then:
Here's the relevant parts:
The solution would be to cache the results from functions like
get_convhull
,fit_ellipse
and etc. that are called repeatedly for each derived feature.I appreciate that it might be annoying to save that in the
data
dict in the graph pipeline since they're not serializable later in a dataframe though. One solution could be to usefunctools.lru_cache
. This would work like:Then, the next time
get_convhull
is called with the same arguments in the pipeline, it'll just return the cached results without recomputing them. If it works, it's a lot less work than refactoring the graph pipeline to carry the cached results around with thedata
dict :)The text was updated successfully, but these errors were encountered: