Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the connector hashing and extend condesc_t #1528

Merged
merged 11 commits into from
May 20, 2024

Conversation

ampli
Copy link
Member

@ampli ampli commented May 17, 2024

This patch enumerates the connector descriptors and uses this number in the connector hash functions.
(I also plan to use it as an array index in a future parse speedup.)

It also extends the information obtained through Connector::desc without increasing its 32-bit size by adding a pointer to condesc_t pointing to extra data. This will allow us to add more data per connector type, e.g., costs for multi-connectors (#1351, #1352) and a cost table per connection length (I think mentioned in #632, and a placeholder added in 2c1d979). The idea is that the data needed during parsing for easy_match is still directly available, while obtaining other data before and after the parsing may need an extra redirection.

The hashing functions have also improved, and they should now keep the collisions low, hopefully without pathological cases (the current hash doesn't differentiate between connectors with more than 4 UP and 4 LC characters, and increasing connector_hash_t to size_t turned out to be too costly).
I handled two types of collisions:

  1. Different connector lists that map to the same 32-bit hash value. Enabling the existing per-connector feedback hash (with different multiplications constants) reduced them by 2+ orders of magnitude.
  2. Extra collisions due to different hash values that hashed to the same slot. The added Fibonacci multiplication significantly reduced them.

I've also included a few minor efficiency tweaks. I tested the speed on the corpus batches and the big-dict (from #1479), and it seems there is a small speedup (but this patch is not about speedup).

ampli added 11 commits May 15, 2024 17:10
The second invocation of connector_list_equal(slot->clist, c) is not
needed, because at this point we know that its result was "true" 2 lines
before. So replace it by "true".

This is only a very slight speedup, if any, because the compiler (in
-O3) knows it is a "pure" function and thus returns its previous result.
Due to changes in include files I got a compiler warning that it shadows
the global verbosity variable.
Keep condesc_t at 32 bytes after adding con_num,
which will be used in connector hash and might later be used for
radix sort.
@ampli ampli changed the title Improve the connector hashing and add extend condesc_t Improve the connector hashing and extend condesc_t May 17, 2024
Copy link
Member

@linas linas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the circular pointers seem unusual. Are you sure you meant that? Merging anyway.

@@ -424,7 +424,7 @@ static bool sort_condesc_by_uc_constring(Dictionary dict)
condesc_t *condesc = dict->contable.hdesc[n].desc;

if (NULL == condesc) continue;
calculate_connector_info(condesc);
calculate_connector_info(&dict->contable.hdesc[n]);
sdesc[i++] = dict->contable.hdesc[n].desc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be same as condesc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condesc is dict->contable.hdesc[n].desc, not dict->contable.hdesc[n].
To be clearer, these lines can be changed to (not tested):

hdesc_t *hdesc = &dict->contable.hdesc[n];

if (NULL == hdesc->desc) continue;
calculate_connector_info(hdesc);
sdesc[i++] = hdesc->desc;

{
lc_enc_t lc_letters;
lc_enc_t lc_mask;
hdesc_t *more; /* More information, for keeping small struct size. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So hdesc points at condesc, and condesc points at hdesc. Interesting ... the old code did not have this loop.

Copy link
Member Author

@ampli ampli May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hdesc are the slots of the connector hash table. Each one that is used, has a desc pointer to a condesc_t element. Up to here it is a standard implementation of an open hash table.
The more elements of condesc_t need to point somewhere, and it was convenient to use the hdesc slots, which are already allocated (it requires less code). However, this is memory inefficient (and indirectly CPU inefficiently due to cache trashing) because most of the hdesc elements are empty (NULL desc).

The solution is to use a memory pool for the more elements. I will eventually send a cleanup PR for that (and the sort_condesc_by_uc_constring() code).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Ah! OK. I think I understand. This wasn't obvious from casual perusal, so I thought I'd point it out.

@linas linas merged commit 52bf569 into opencog:master May 20, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants