Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to control linkage deduplication #1412

Merged
merged 2 commits into from
Feb 5, 2023
Merged

Conversation

ampli
Copy link
Member

@ampli ampli commented Feb 5, 2023

See the discussion at PR #1396.

  • Add -test=linkage-dedup:N
    Without this argument, the default is to perform linkage deduplication iff there is sampling.
    With this argument:
    -test=linkage-dedup:0 : Never do linkage deduplication.
    -test=linkage-dedup:1 or -test=linkage-dedup: Always do linkage deduplication.
  • By default, perform connector encoding for any sentence length
    ... so linkage deduplication due to multi-connectors, which depend on connector memory sharing, will be done for short sentences too.

This patch replaces PR #1410.

return;

printf("DEDUP\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray printf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fixed...
(My test program that comparte linkages ignores such junk...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have forced-push the fix.

* Update: For a consistent linkage deduplication, encoding (which
* included tracon sharing) should always be done. And now the overhead
* is negligible. */
#define SENTENCE_MIN_LENGTH_TRAILING_HASH 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the min_length_encoding is likely to be zero forevermore, it would be nice to remove this entirely, to keep the code less confusing.

However, I notice that

parse/preparation.c:	wc_word_list->min_len_encoding = 2; /* Don't share/encode. */

and I can't tell if that's important or not ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway, that doesn't have to be done in this pull req.

Copy link
Member Author

@ampli ampli Feb 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is essential to leave intact -test=min_length_encoding:N.

Here is what is currently done when encoding connectors:

  1. Always done (for any min_length_encoding): Disjunct and connector packing. It means putting all the disjuncts and connectors in a contiguous memory block. This speeds up the parsing due to better use of the CPU cache memory (and also maybe main memory because they can then be fetched from a contiguous memory block).
  2. Tracon sharing. This causes most of the connectors to share their memory so it drastically reduces the space they use. It is a very subtle step, and a great care is taken not to share connectors that have attributes that are used during the parsing and their values may affect the correctness of the parsing. If such attributes would be added to Connector_struct and the affected connectors would still be shared, it may result in some wrong parses. See the comment block before the line newc = NULL; /* Don't share it. */ in pack_connectors().
  3. Connector enumeration. The encoding which is done for the power pruning step consists of adding a memory-sharing reference count for each connector. With no encoding, no sharing is used so the connector reference count is always 1.
    For the parsing step, the encoding consists of enumerating the tracons (of course, shared tracons use the same tracon_id). With no encoding, they are just enumerated sequentially.

The tracon sharing is done in a slightly different way for the pruning and the parsing steps (for the pruning, the shallow connectors are never shared with non-shallow ones). To find out which tracons can be shared, the "tracon-set" module is used.

As you can see, the encoding step is complex and fragile. If something in it gets wrong (due to bit rot or a bug), or a problem in the code that depends on encoding happens, then some parses may become wrong (it can be very subtle - like one wrong parse per thousand sentences).

So for checking code changes (of you and me), I have a test program that produces up to 150k parses from each sentence, and I run it with -test=min-len-encoding:0 and also with -test=min-len-encoding:1 and then compare the results (it knows to remove duplicate linkages and to order same-metric linkages in canonical order, similarly to what you have recently introduced). Many times it found differences in the linkages due to bugs, especially in the encoding code, or due to wrong use of the encoding in the power pruning or the parsing steps.

To sum up, this is the main reason it is essential to leave intact -test=min_length_encoding:N. But in principle, it could be simplified to 0/1 values meaning whether to perform it or not (but the simplification may not be significant).

(This explanation appears using some other words also in library comments.)

Regarding
parse/preparation.c: wc_word_list->min_len_encoding = 2; /* Don't share/encode. */
it is to ensure the encoding will not be done. Doing it will just waste resources, as it will be redone later. This assumption should be checked again since connector sharing would reduce the size of the saved wildcard disjunct prototype block and make its copying faster, and maybe the tracon_ids in it may just be relocated for each word (instead of making the enumeration from scratch). In any case, I am in the last steps of rewriting the generation to improve its speed and I will add that to my checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. That this was critical for testing wasn't obvious. I added a note in baacd39

@linas linas merged commit da618dc into opencog:master Feb 5, 2023
linas added a commit that referenced this pull request Feb 6, 2023
@ampli ampli deleted the linkage-dedup branch May 25, 2024 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants