Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing matches in the result #1

Open
hadoth opened this issue Jul 11, 2023 · 1 comment
Open

Missing matches in the result #1

hadoth opened this issue Jul 11, 2023 · 1 comment

Comments

@hadoth
Copy link

hadoth commented Jul 11, 2023

Steps to reproduce:

Set query molecule to didD@@QInUxV`@@B and run using the idorsia_toy_space_a.txt synthon space.

Expected behavior:

For Synthon_A of snar_b-25 there should be hits for four different synthons: dcLDpEtKhhbSiIf^v[hHBf@@, dcLDpEtKichYAIeY~kh@bf@@, dmtDPITKickHhdhcJz@Hf@@ and dcNDPAWPnfNdfUgzn`BJX@@ (560 in total).

Actual result:

Results for only one synthon are returned (440 in total).

Probable cause:

Break in line 2763 and 2767 of SynthonSpace.java. As a result mapped_frag is always of size 1. This seems to be done on purpose, but results in rather unexpected behavior.

@lithom
Copy link
Collaborator

lithom commented Aug 29, 2023

Thanks for testing the hyperspace software thoroughly!

The observed behavior is indeed "a feature" and intentional. One of the challenges of implementing the algorithm was to handle "very general" queries in a reasonable way. The reasoning behind this "break" is, that in case that we have a complete substructure hit inside a single building block, then we assume that the query is very general and will probably generate millions of hits (in the toy space, 500 / 37k is >1% of the complete space, in large spaces this might be millions or billions of structures, probably "more than we can easily handle in subsequent processing of the structures"). The measures for handling "excessive results" are also described in the JCIM publication in the subsection "Handling excessive enumeration of results" (it was not in the preprint but was rightly requested by one of the reviewers).

I agree that this is somewhat confusing, and it might cut off interesting structures. The first implementation of the software was using a "process all structures, then report the full result at once" approach, therefore it was really necessary to have such rather strict cutoff criteria. I extended the software and now it can also "continuously stream" results, i.e. it could be an option to remove these cutoff mechanisms in the algorithm. I don't know if this is really helpful though, alternatively it could make sense to include in the results a "results might be truncated" flag in case that one of the cutoff mechanisms is engaged (there are also two other hard limits in the algorithm).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants