Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance based on Sprint presentation #603

Open
libpitt opened this issue Feb 4, 2025 · 5 comments · May be fixed by #657
Open

Investigate performance based on Sprint presentation #603

libpitt opened this issue Feb 4, 2025 · 5 comments · May be fixed by #657
Assignees

Comments

@libpitt
Copy link
Contributor

libpitt commented Feb 4, 2025

https://docs.google.com/presentation/d/1FbRHX-VOrvvlKQ66Dvtxub4NJUMLjRB0gdNJOYbbjuY/edit?usp=sharing

@libpitt libpitt added this to CODCC Feb 4, 2025
@libpitt libpitt converted this from a draft issue Feb 4, 2025
@maxsibilla maxsibilla moved this from Backlog to Ready in CODCC Feb 4, 2025
@libpitt
Copy link
Contributor Author

libpitt commented Feb 14, 2025

  • Speed up get_dataset_title
  • Consolidate schema_neo4j_queries.get_children there is a app_neo4j_queries.get_children with exact query
    In trigger.get_normalized_upload_datasets perhaps not all Dataset fields are needed, find out
    In trigger.set_dataset_sources perhaps not all Source fields are needed, find out

@libpitt
Copy link
Contributor Author

libpitt commented Feb 14, 2025

If ever needed to speed up the time for triggers was wanted, a good place to start would be at generate_triggered_data. Since properties_to_skip determines target properties, should actually run the loop from this perspective.

Before the call to generate_triggered_data, if it is an exclude action, then should filter out the exclude fields first. Then instead of properties = get_entity_properties(schema_section, normalized_class), properties then becomes the only the properties to include and not the entire entity's schema list, unless of course there was nothing to exclude.

Why?
Because if we have say 100 Datasets, to handle these 100 Dataset triggers, call it length n, we are looping through Dataset m number of properties for every Dataset. If we have some properties to exclude, call that x number we can improve time a bit, or much.

Without anything to exclude we have: O(n*m). If we reduce O(n*(m-x)), it's still quadratic. But wait, if x = m-(m-1) meaning only 1 thing to include then we end up with O(n). That's linear.

Let's use real numbers:

`n = 100`
`m = 49` (a rough count of current number of Dataset properties)

  • O(n*m) = O(100*49)=4900 iterations (sometimes a trigger might even have loops!, so this is best case)

  • If we actually only wanted 2 triggers calculated, meaning 47 excluded, we end up with.
    O(n*(m-x)) = O(100-(49-47)) = O(100*2)=200 iterations

  • And as said, if just needed 1 trigger field to include we have O(100*1)=100 iterations. Linear time.

So instead of going through the entire list of properties every time, we should filter out the unneeded ones, like non trigger fields and trigger fields not necessary for the response.

cc @maxsibilla @tjmadonna @yuanzhou

@yuanzhou
Copy link
Member

Thanks @libpitt. It'll be nice to confirm the improvements with some code profiling.

@libpitt
Copy link
Contributor Author

libpitt commented Feb 17, 2025

Thanks @libpitt. It'll be nice to confirm the improvements with some code profiling.

@yuanzhou That would be neat. What's the tool you used in your report?

@yuanzhou
Copy link
Member

@libpitt SnakeViz is the tool I used to visualize the results, you can also use Graphviz. There are various packages for the actual profiling, Python includes a profiler called cProfile.

@libpitt libpitt linked a pull request Mar 6, 2025 that will close this issue
libpitt added a commit that referenced this issue Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready
Development

Successfully merging a pull request may close this issue.

4 participants