Investigate performance based on Sprint presentation #603

libpitt · 2025-02-04T19:18:57Z

https://docs.google.com/presentation/d/1FbRHX-VOrvvlKQ66Dvtxub4NJUMLjRB0gdNJOYbbjuY/edit?usp=sharing

libpitt · 2025-02-14T20:20:04Z

Speed up get_dataset_title
Consolidate schema_neo4j_queries.get_children there is a app_neo4j_queries.get_children with exact query
~~In trigger.get_normalized_upload_datasets perhaps not all Dataset fields are needed, find out~~
~~In trigger.set_dataset_sources perhaps not all Source fields are needed, find out~~

libpitt · 2025-02-14T21:04:15Z

If ever needed to speed up the time for triggers was wanted, a good place to start would be at generate_triggered_data. Since properties_to_skip determines target properties, should actually run the loop from this perspective.

Before the call to generate_triggered_data, if it is an exclude action, then should filter out the exclude fields first. Then instead of properties = get_entity_properties(schema_section, normalized_class), properties then becomes the only the properties to include and not the entire entity's schema list, unless of course there was nothing to exclude.

Why?
Because if we have say 100 Datasets, to handle these 100 Dataset triggers, call it length n, we are looping through Dataset m number of properties for every Dataset. If we have some properties to exclude, call that x number we can improve time a bit, or much.

Without anything to exclude we have: O(n*m). If we reduce O(n*(m-x)), it's still quadratic. But wait, if x = m-(m-1) meaning only 1 thing to include then we end up with O(n). That's linear.

Let's use real numbers:

`n = 100`
`m = 49` (a rough count of current number of Dataset properties)

O(n*m) = O(100*49)=4900 iterations (sometimes a trigger might even have loops!, so this is best case)
If we actually only wanted 2 triggers calculated, meaning 47 excluded, we end up with.
O(n*(m-x)) = O(100-(49-47)) = O(100*2)=200 iterations
And as said, if just needed 1 trigger field to include we have O(100*1)=100 iterations. Linear time.

So instead of going through the entire list of properties every time, we should filter out the unneeded ones, like non trigger fields and trigger fields not necessary for the response.

cc @maxsibilla @tjmadonna @yuanzhou

yuanzhou · 2025-02-17T15:47:49Z

Thanks @libpitt. It'll be nice to confirm the improvements with some code profiling.

libpitt · 2025-02-17T16:01:45Z

Thanks @libpitt. It'll be nice to confirm the improvements with some code profiling.

@yuanzhou That would be neat. What's the tool you used in your report?

yuanzhou · 2025-02-17T16:13:33Z

@libpitt SnakeViz is the tool I used to visualize the results, you can also use Graphviz. There are various packages for the actual profiling, Python includes a profiler called cProfile.

libpitt added this to CODCC Feb 4, 2025

libpitt converted this from a draft issue Feb 4, 2025

libpitt assigned libpitt, tjmadonna and maxsibilla Feb 4, 2025

maxsibilla moved this from Backlog to Ready in CODCC Feb 4, 2025

libpitt linked a pull request Mar 6, 2025 that will close this issue

Libpitt/603 616 650 651 improvements #657

Open

libpitt added a commit that referenced this issue Mar 6, 2025

Consolidate get_children method - #603

af62690

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance based on Sprint presentation #603

Investigate performance based on Sprint presentation #603

libpitt commented Feb 4, 2025 •

edited

Loading

libpitt commented Feb 14, 2025 •

edited

Loading

libpitt commented Feb 14, 2025 •

edited

Loading

yuanzhou commented Feb 17, 2025

libpitt commented Feb 17, 2025

yuanzhou commented Feb 17, 2025

Investigate performance based on Sprint presentation #603

Investigate performance based on Sprint presentation #603

Comments

libpitt commented Feb 4, 2025 • edited Loading

libpitt commented Feb 14, 2025 • edited Loading

libpitt commented Feb 14, 2025 • edited Loading

yuanzhou commented Feb 17, 2025

libpitt commented Feb 17, 2025

yuanzhou commented Feb 17, 2025

libpitt commented Feb 4, 2025 •

edited

Loading

libpitt commented Feb 14, 2025 •

edited

Loading

libpitt commented Feb 14, 2025 •

edited

Loading