-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add comments to ff fitting example, + some function signatures/docstrings #2
base: main
Are you sure you want to change the base?
Add comments to ff fitting example, + some function signatures/docstrings #2
Conversation
|
||
# 2. Configure the objective tiers | ||
final = fits.objective_tier() | ||
# set the objectives for the final tier | ||
# each value is an ``objective_config`` | ||
final.objectives = { | ||
0: fits.objective_config_position( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too sure what the keys final.objectives
represent -- do they represent order, or do they maybe correspond with the assignments.POSITIONS
index? If the latter, constructing the dict with {assignments.POSITIONS: fits.objective_config_position(...
might be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The objective keys are completely arbitrary ints; they represent order. I did it more for the case of when the the objective list gets split up they maintain their "index"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thank you! So is it more of an administrative/logging device to track each objective individually, or does it affect how the objective is computed? If it's just a naming thing, are integers the only keys accepted and does it expect them to be contiguous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They have no almost no impact on how the objective is computed, other than the order they are added (FIFO; put expensive things first). I haven't put much thought into the objective keys, but I've been assuming ints. Looking through the code, I think the only part that uses the objective key is when I break up the particular computation into pieces, I keep track of ownership using the key. So, I would guess anything hashable is usable as a key. For example I have this in optimizers_scipy
:
tasks[(n, (i, (0,)))] = task
where n
is the iteration number, i
is the objective key in question, and (0,) refers to no parameter deltas (e.g. the f(x0) in the numerical derivatives. This is mostly useful for Hessians, where a molecule can have 100s of deltas to compute, and I break them into pieces and distribute them for computation. I collect the results based on the objective key (i
above). That said, they shouldn't need to be continuous, just something hashable to organize computations with.
# create a graph and add it to the graph_db | ||
g: graphs.graph = gcd.smiles_decode(smi) | ||
# molecule_graph_id is the graph ID of the graph in the graph_db | ||
molecule_graph_id: int = assignments.graph_db_add_graph(gdb, smi, g) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use besmarts with multi-molecule data (e.g. dimers) at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stuff like dimers is the reason I made the much more complicated graph_db data structure.. however right now nearly everything assumes a single molecule, so it would need some work/checking to get going correctly. Each molecule coordinates would be a graph_db_column and the graph_db_row could be considered a single snapshot/frame, where each of these would be associated to the graph_db_graph of the same graph ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I think I understand. So can a graph.graph
contain multiple molecules? IIRC the graph_same check only checks encoded primitives so coordinates wouldn't differentiate the molecules. So I'm wondering if a dimer energy of identical molecules would need to reference the same graph ID twice, and how that would interact with graph_db_tables.gid
.
(Just to check my understanding -- does the below sound right to you?)
Each graph_db_entry in the gdb contains graph_db_tables, which contains graph_db_graphs that are referenced by gid
. Coordinates are stored separately by atom in graph_db_columns, which are collected in graph_db_rows. In the case of using an interaction energy between two different molecules, I think the value at gde.tables[assignments.ENERGIES].values[snapshot_i]
would be the interaction energy at gde.tables[assignments.POSITIONS].graphs[gid].rows[snapshot_i]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been treating all graphs as fully connected; I haven't gone into much thinking/testing whether disconnected graphs should be allowed as of yet. There are pros and cons to doing so, but right now I'm leaning towards having everything connected (and maybe adding empty bonds for specific problems; these would encode to . in the string representation). I believe everything in graphs
is strictly 2D based, so yes the graph operations won't pick up on different conformations. The dimer would most certainly reference the same graph ID; the underlying graph is how selections are referenced in the graph_db_columns (if you have IC data; perhaps less relevant here). As an example you can query bonds of a graph, say (1, 2) by calling graphs.graph_bonds(gdb.graphs[gid])
, which would be a selection in the graph_db_column, such as gde.tables[BONDS][gid][rid][cid][(1,2)]
(if you put data there). There is no way to specify conformations to a graph object; think of them as part of a regular FF topology. The easiest way to do so would be a assignments.graph_assignment
object, which are used in most places needing positions.
And yes, that is a quick way to make that work and probably the only sane way to do so, showing the usefulness of having a dumb values member :) I read that as the total energy for a particular snapshot. There may be two gids each with one column for snapshot_i
, or the same gid with two columns with different sets of coordinates.
Hi @lilyminium, I appreciate your effort here. I think the intention here is to leave this up for some time as a means to take notes, or are you done? In general I would be happy to merge your input as part of improving the example's clarity. |
energy_data = f.read().split("\n") | ||
# assume energy data is a list of floats with one energy per line | ||
energy_values = [*map(float, [x for x in energy_data if x])] | ||
energy_gdt.values.extend(energy_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this example contained multiple observations to fit to (so multiple gids I assume), how would besmarts map each energy value to the corresponding coordinates? I could be missing something here but I don't see any modification of energy_gdt.graphs
(although I guess energies aren't actually part of this example!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I never really intended to use energies in this example, and this snippet is more destructive than constructive. The value member field was just a lazy way to append data without having to deal with the data structure as I worked on various objective functions (hessian freqs, DLC gradients, etc), so I consider the values member to be "user defined" :). To answer your question generally, the (proper/designed) way to do this would be to set up a graph_db table (with a null topology since it is per graph/molecule/system) and then you would assign energies as you would like. I note that a weakness here is that I don't have a great way to set a total energy anywhere, it would have to be a sum over all graph_db_graph objects for a particular row_id. This would give you a bunch of graph_db_columns, and for null topologies I reserve the selection (0,) to refer to the entire molecule, remembering that I use 1-index for selections for atoms so 0 is never used anywhere else. (I haven't actually done this anywhere, but this is how I will do it once I start fitting energies). This would be mostly for QM energies, but it should be possible to do IC energies instead for MM. One could set up a custom addressing to store the totals in the values member; this greatly simplifies computing the objective at the end, which is essentially why I added it in the first place.
Thanks Trevor, I still have a couple questions that might help with the notes but in general happy for you to use this however you like, and merge if you think it'll be helpful! I can let you know when I'm done updating. |
I've updated this example and refactored it to use stuff that is in core. |
Hi @trevorgokey, I was going through the 07_besmarts_fit example and making some notes for myself on what each line does / the Python object structures, as the tutorial tends to show a large portion of code at once and pull out some highlights to explain. Just to check my understanding, do you reckon the comments are accurate or have I completely misunderstood some of what's going on? I do have some questions remaining on the data structures that I'll add as comments.
In case you think the notes will be useful to anyone else, I've opened this PR just to suggest adding them in.