Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multiple deduplication bugs #1396

Merged
merged 8 commits into from
Jan 24, 2023
Merged

Fix multiple deduplication bugs #1396

merged 8 commits into from
Jan 24, 2023

Conversation

linas
Copy link
Member

@linas linas commented Jan 24, 2023

This fixes multiple issues reported for #1378, including:

  • Bad identification of region to copy
  • Failure to compare dictionary words
  • Inefficient copying

@linas linas mentioned this pull request Jan 24, 2023
@linas linas merged commit d58da3e into opencog:master Jan 24, 2023
@linas linas deleted the dedupe2 branch January 24, 2023 20:43
@ampli
Copy link
Member

ampli commented Jan 24, 2023

There is still a similar silence problem:
The response was that there are several barriers that have to come down.

In addition to fixing the problem, I propose to add an assert() that at least one linkage remain.

Questions:

static void sort_linkages(Sentence sent, Parse_Options opts)
{
if (0 == sent->num_linkages_found) return;
/* It they're randomized, don't bother sorting */
if (0 != sent->rand_state && sent->dict->shuffle_linkages) return;

// Special-case hack.
if ((0 == strncmp(dict->lang, "any", 3)) ||
(NULL != dict->affix_table->anysplit))
dict->shuffle_linkages = true;

For any there is a special check for not sorting. What is the reason?
Also, in that case there is no deduplication. Is it intended?
Now that we have dict #define, maybe it's better to define shuffle_linkages in the any dict?

@linas
Copy link
Member Author

linas commented Jan 24, 2023

Yeah ... There's still something weird going on. I reordered the code to mark during sorting, and am now finding many many more duplicates. And when this happens, somehow something doesn't work.

Yes, the any should be a dict #define

@linas
Copy link
Member Author

linas commented Jan 24, 2023

Ahhh there are more duplicates, than valid linkages. Then it fails... because I dedupe the bad linkages too. Perhaps the bad linkages shouldn't be sorted or deduplicated, anyway!?

@linas
Copy link
Member Author

linas commented Jan 24, 2023

The any linkages all have the same cost, so in that sense, they are unsortable. However, the newest code (not yet pushed) could provide a unique sort, anyway, and thus allow deduplication. However ... we do not want that. Because ...

The any language is supposed to provide a uniform statistical sample of all possible planar trees. Duplicate removal might ...!??

Certainly, if num linkages < limit, then all planar linkages will be enumerated, one each, there won't be duplicates, and so these are by definition "uniformly distributed". If num linkages > limit, then ... we sample ... does that sampling result in a uniform distribution? I don't really know. I assumed it did ... Would removing duplicates make it "more uniform" if it is not? How many duplicates might there be, anyway?

The reason for the shuffle is that when I sample, I set limit=1000 but then take only the first 24 linkages, ever. I'd tripped over a bug, where, no matter what the sentence, I always got back exactly the same 24 parses. So that wasn't uniform at all, over thousands of sentences. The shuffle guarantees that a different set of 24 appears each time.

@linas
Copy link
Member Author

linas commented Jan 24, 2023

I won't be fixing the any #define thing ... if you want to do that, go ahead ...

@linas
Copy link
Member Author

linas commented Jan 24, 2023

#1397 fixes the too-many-duplicates bug mentioned above.

linas added a commit that referenced this pull request Jan 25, 2023
@ampli
Copy link
Member

ampli commented Jan 25, 2023

(EDITED to fix linkage number typos.)
I still get problems with my checks.
Here is one of them.
The other one is with a sentence having 214568 linkages which apparently has a problem due to checking connector addresses in linkage_equiv_p(). I will investigate this further, but also see below.

My test sentence is:
He said that, finding that it was impossible to get work as a waiter, he would work as a janitor

In the current release (5.12.0) I get:
Found 2280 linkages (241 of 1000 random linkages had no P.P. violations)
If I filter the results through my duplicate elimination Perl program, it founds 61 duplicates and 180 w/o duplication.

In the repository release (9c071f7, just before your recent dict changes):
Found 2280 linkages (185 of 944 random linkages had no P.P. violations)
My duplicate elimination programs then still find 5 duplicates!
Edited:
For example, linkages 27 and 28 are duplicates of linkage 26.
For example, linkages 28 and 29 are duplicates of linkage 26.

I still don't understand why linkage_equiv_p() misses them.

So something in the deduplication algo apparently doesn't detect all the duplicates.

Some other issues:

Use Connector::desc not tracon_id per #1396

(I cannot find tracon_id mentioned in # 1396, so I guess it was in another issue/PR.)

There are some problems in using desc:string:

  1. It only makes the problem of potentially false duplicates (if possible) more severe, as indeed tracon addresses (or tracon ID's) may be the same in different disjuncts of the same word (that happen to end with the same tracon), but connector names may be the same in totally unrelated disjuncts, even not on the same word! (It is not checked that they are on the same word).
  2. desc::string is common to regular multi connectors, i.e. it doesn't contain the @. This surely may lead to false duplicates. So Connector::multi should be checked too.

In addition, connector address comparisons are still done, as in:
if (lpv->link_array[li].lc != lnx->link_array[li].lc)
As I mentioned elsewhere, connector addresses may be equal on different disjuncts due to tracon compression.
This causes deduplication problems (and I still need to give you exact examples).

Last thing here:
For short sentences, there is +~10% added overhead. Since the linkage info index is negative when sampling is to be done, maybe linkage_equiv_p() may be skipped on negative indexes.
However, thinking about that brought me another idea for deduplication, which I will write about shortly.

@ampli
Copy link
Member

ampli commented Jan 25, 2023

However, thinking about that brought me another idea for deduplication, which I will write about shortly.

This idea is not as simple as I first thought, and anyway I recalled it had already been used in the old code:

In case there is no linkage overflow, say we found N linkages when L=linkage_limit.
The idea is to create L unique random numbers [0..N) and use them as the linkage index.
However, it may not be particularly simple to create such a list.

@ampli
Copy link
Member

ampli commented Jan 25, 2023

In #1378 you said

After fixing assorted bad tests & etc the deduplication code for Baird is no longer entered. But if let dedupe run anyway, I get this:

No flags (default config) one duplicate found, the second linkage!
-test=min-len-encoding:0 same
-test=min-len-encoding:254 no duplicates seen
The difference between the first and second linkages is that the first uses vice.s and the second uses vice.n-u although the link types used are the same. To be backwards-compatible, I will need to also check for mismatched words.

There is still linkage differences in some sentences with min-len-encoding:0 and min-len-encoding:254, e.g.:
In this case, surface tension of the liquid solder will pull on one side more than the other, and the part will tombstone.

Since the linkages without deduplication are the same, I think that it is a bug.
To be on the safe side, there is a need to check this using the version just before the dict changes.
I also think that since the linkage comparison is subtle (and it is not clear yet how to fix it), it will be a good idea to add -text=no-deduplication so we can directly validate the results.

@ampli
Copy link
Member

ampli commented Jan 25, 2023

In this case, surface tension of the liquid solder will pull on one side more than the other, and the part will tombstone.

Note that it may not be easy to check that min-len-encoding:0 and min-len-encoding:254 indeed produce the same linkages when random sampling is not used.
To produce linkage diagrams file for this sentence, you can do:

echo 'SENTENCE' | link-parser -test=min-len-encoding:0,auto-next-linkage -lim=300000 > /tmp/l0
echo 'SENTENCE' | link-parser -test=min-len-encoding:254,auto-next-linkage -lim=300000 > /tmp/l254

But diff /tmp/l0 /tmp/l254 will then show many differences due to slight linkage order changes (I yet need to investigate why there is an order change but for now, I found it is not due to address randomization).
But if I alphabetically sort the linkages there are no differences (and no duplicates - no random sampling has been used).

Another observation is that with -limit=1000 (the default linkage_limit) there are still 6 undetected duplicates.

@ampli
Copy link
Member

ampli commented Jan 25, 2023

I found the reason for the linkage order difference if tracon compression (connector address sharing) is used (the default).

static int linkage_equiv_p(Linkage lpv, Linkage lnx)
{
// Compare links
for (uint32_t li=0; li<lpv->num_links; li++)
{
if (lpv->link_array[li].lc != lnx->link_array[li].lc)
return strcmp(
lpv->link_array[li].lc->desc->string,
lnx->link_array[li].lc->desc->string);
if (lpv->link_array[li].rc != lnx->link_array[li].rc)
return strcmp(
lpv->link_array[li].rc->desc->string,
lnx->link_array[li].rc->desc->string);

This is due to the connector address checks and the following strcmp().
Recall that when tracon address sharing is used (the default, same as -test=min-len-encoding:0), connectors on totally different disjuncts of the same words may share connector addresses. If such a sharing is detected, then the order may be changed. However, with -test=min-len-encoding:254 there is no connector addresses are shared, and the order is preserved.

However, I still don't understand why some duplicates are not detected.

@linas
Copy link
Member Author

linas commented Jan 25, 2023

You are confusing me.

  • Random linkages will depend on the random number generator. This should generate exactly the same random numbers on two different runs. These will get used exactly the same way, but only if everything else is the same. Are you sure that -test=min-len-encoding is not changing how the random number generator is being called?
  • The failure to detect all duplicates does seem to be a concern. But only if there was not a mis-enumeration, due to random number generator differences. So again: are you sure the random number generator is being used in exactly the same way?
  • All of the various comments about shared connectors etc do not seem to be important or relevant. I am looping over the list of links in the parse, and all I care about are whether the two endpoints of the link are the same: the same connector type, or not. I don't care if the C structures are shared or not; this seems irrelevant.
  • Instead of comparing links, I guess I could have compared to see if chosen_disjuncts are the same, or not.

Anyway, for my statistical sampling, I'm starting to think that duplicates are not such a bad thing, after all, so I'm open to disabling this code, by default.

@linas
Copy link
Member Author

linas commented Jan 25, 2023

I found the bug about failing to detect differences. It's obvious and a result of late-night cloudy thinking. Fix coming shortly.

@linas
Copy link
Member Author

linas commented Jan 25, 2023

I apologize for changing the dictionary. I should not have done that.

@ampli
Copy link
Member

ampli commented Jan 25, 2023

Let's start with one point I partially agree with you about:

Instead of comparing links, I guess I could have compared to see if chosen_disjuncts are the same, or not.

Such a check is needed too. However, sentences with the same disjuncts may have different linkages.
Here is a minimal example:

test/4.0.dict:

w0: A+;
w1: A+;
w2: @A- & A+;
w3: @A-;

$ link-parser test
linkparser> w0 w1 w2 w3
...
Found 2 linkages (2 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=1)

 +--A--+
 |  +-A+-A+
 |  |  |  |
w0 w1 w2 w3

Press RETURN for the next linkage.
linkparser> 
	Linkage 2, cost vector = (UNUSED=0 DIS= 0.00 LEN=2)

 +----A---+
 |  +-A+-A+
 |  |  |  |
w0 w1 w2 w3

@linas
Copy link
Member Author

linas commented Jan 25, 2023

I fixed the bad linnkage compare in e35ea9f and pushed directly upstream.

@linas
Copy link
Member Author

linas commented Jan 25, 2023

Such a check is needed too.

Hmm. I won't be doing that check. If two different disjuncts give the same links and use the same dict words, then, from the point of view of the linkage, they are the same, even if the dict managed to include two different disjuncts, for the same word (i.e. one with and one without a multi-connector, but otherwise being the same).

Hmmm. At least, I think I want to ignore such differences. But for dict debugging, they do make a difference...

From what I can tell, it is enough to compare pointers for the chosen_disjuncts ....

@linas
Copy link
Member Author

linas commented Jan 26, 2023

The disjuncts are different

How can they possibly be different? We've just finished loops that checked everything there is to check, and found that everything is the same.

@ampli
Copy link
Member

ampli commented Jan 26, 2023

The disjuncts are different

How can they possibly be different? We've just finished loops that checked everything there is to check, and found that everything is the same.

What is supposed to ensure that they are the same? Let's see.

Loop 1: // Compare link endpoints
This of course doesn't ensure the disjuncts are the same.

Loop 2: Compare link names.
This also doesn't ensure the disjuncts are the same.
It also doesn't ensure the endpoint connectors have the same names (because the same link can theoretically be constructed from different connector pairs), but it is a heuristic that they are the same (because the dict is not specially designed to break that). I can add a debug check to see if it always holds.

Loop 3. // Compare words.
This surely doesn't validate that the disjuncts are the same (the different disjuncts on the same word (2D slot) often have the same string).

Loop 4. // Compare connector
This is the loop we discuss. As far as I can see no loop before it ensures the disjuncts here are the same.

And indeed, when I added

	for (uint32_t wi=0; wi<lpv->num_words; wi++)
		assert(lpv->chosen_disjuncts[wi] == lnx->chosen_disjuncts[wi], "XXX");

just above if DOUBLE_CHECK, it failed, for example, for this sentence:
Business is booming, Joe Smith, a car dealer, says

This proves that the disjuncts can be different before loop 4 when loop 4 normally finishes with no return,
and lnx->dupe = true; can be reached when the disjuncts are not the same.
Have I missed something?

@linas
Copy link
Member Author

linas commented Jan 26, 2023

Business is booming, Joe Smith, a car dealer, says

right before the return 0; I placed this:

for (uint32_t wi=0; wi<lpv->num_words; wi++) {
if (lpv->chosen_disjuncts[wi] != lnx->chosen_disjuncts[wi]) {
  printf("duuude difference at %u %s %s vs %s\n", wi,
lnx->chosen_disjuncts[wi]->word_string,
linkage_get_disjunct_str(lpv, wi),
linkage_get_disjunct_str(lnx, wi));
}}

prints

duuude difference at 3 boomingv @MXs+ @MXs+ Ss+ vs @MXs+ Ss+

Casual examination shows that neither of these disjunct strings is correct. WTF

    |                                 |          |    +------Xd-----+
    +-------->WV-------->+            |          |    | +---Ds**x---+
    +---->Wd----+---Ss---+--Osm-+     |   +---G--+    | |   +---AN--+-Xc-+
    |           |        |      |     |   |      |    | |   |       |    |
LEFT-WALL business.n-u is.v booming.v 

@linas
Copy link
Member Author

linas commented Jan 26, 2023

Oh never mind. The printed parse is not the parse being shown

@linas
Copy link
Member Author

linas commented Jan 26, 2023

It should have been this one:

    |                           +----------------------Ss------------------
    |                           +----------------MXs----------------+
    |                           +-------MXs------+    +------Xd-----+
    +-------->WV-------->+      |     +----Xd----+    | +---Ds**x---+
    +---->Wd----+---Ss---+      |     |   +---G--+-Xca+ |   +---AN--+-Xc-+
    |           |        |      |     |   |      |    | |   |       |    |
LEFT-WALL business.n-u is.v booming.v , Joe.b Smith.m , a car.n dealer.n ,

So the disjuncts @MXs+ @MXs+ Ss+ and @MXs+ Ss+ are printed differently, but they result in the same parse. The "de feacto" disjunct would be 2@MXs+ Ss+ if you allow me to invent a new notation to say "the MXs connector was used exactly twice."

For linkage de-deuplication, it is the defacto links that I'm interested in, and not the dictionary entries that lead to them. The reason for this is because ... (next post)

@linas
Copy link
Member Author

linas commented Jan 26, 2023

The reason for this ... just evaporated. I was going to write something, and I see it won't support my argument. One of my dicts has expressions entirely of the form {A+/-} & {B+/-} & {C+/+} & {D+/-} where the A,B,C,D can be any one of thousands or tens of thousands of connectors. An earlier/alternate version used @A+/- & @B+/- & @C+/+ & @D+/- but for some reason, I decided "zero or one" is better than "zero or more".

So when I get a linkage, I am only interested in the actual connectors used, and not the hypothetical multi's that were never used.

@linas
Copy link
Member Author

linas commented Jan 26, 2023

FYI, if I turn on deduplication, I find that linkages 9, 12 35, 39, 47, 50, 63 and 67 are duplicates. Presumably, one linkage corresponds to the @MXs+ @MXs+ Ss+ disjunct while the other corresponds to the @MXs+ Ss+ disjunct.

This raises the question: should deduplication be turned on in general? Should it be a run-time flag?

@ampli
Copy link
Member

ampli commented Jan 27, 2023

For linkage de-deuplication, it is the defacto links that I'm interested in

Previously I said (multiple times) that the default is -test=min-len-encoding:0. However, in the current code it is actually 6 and not 0, due to a very slight overhead in doing connector encoding for very short sentences. So fo consistent results, it should be changed to 0. This is fine since the overhead is negligible. But I think it will be a good idea to ensure that the eliminated linkage is the one with more connectors. Is this what the current code does?

However, I think it will be better to just collapse "@x @x" sequences to a single @x in eliminate_duplocate_disjuncts().

@ampli
Copy link
Member

ampli commented Jan 27, 2023

This raises the question: should deduplication be turned on in general? Should it be a run-time flag?

At least, there should be a -test flag to disable the linkage deduplication.
If we find a good use of that, then we can convert it into a real option.

@ampli
Copy link
Member

ampli commented Jan 27, 2023

collapse "@x @x" sequences to a single @x in eliminate_duplocate_disjunct()

I just found an old code in which I added duplicate multi elimination in build_disjuncts(), in a hope of a speedup (and got a slight slowdown instead).
However, I did it then in an inefficient way and if eliminating such disjuncts is a good idea I can improve it.

@linas
Copy link
Member Author

linas commented Jan 27, 2023

Proceed as you wish; I'm done, here.

I'm now off to fry my explosively large dictionaries...

FYI, I'd like to publish version 5.12.1 in a few days or weeks or whenever its ready, before too long.

@ampli
Copy link
Member

ampli commented Feb 5, 2023

duuude difference at 3 boomingv @MXs+ @MXs+ Ss+ vs @MXs+ Ss+

The existence of @MXs+ @MXs+ means the disjunct should connect exactly two words to the right. See below for my question.

I said:

collapse "@x @x" sequences to a single @x in eliminate_duplocate_disjunct()

I wrote the code to do that, and debug printing also found things like hCO @hCO Wd turning into @hCO Wd.
However, in general, it is correct to just collapse sequences of N same string connectors (when at least one of them is multi) into a single multi-connector, unless there are similar disjuncts with a sequence of M such similar connectors when M < N. E.g. it will be wrong to collapse hCO @hCO Wd into @hCO Wd unless there is already an identical such disjunct.

So I propose another solution to my problem of automatically comparing linkage results after making a change:

This raises the question: should deduplication be turned on in general? Should it be a run-time flag?

Providing such a flag (e.g. -test=no-linkage-dedup`) will solve the above problem. This can also serve us if we suspect that a linkage is wrongly suppressed, e.g. after changes in the sorting function. So I will implement such a flag.

@linas
Copy link
Member Author

linas commented Feb 6, 2023

Some confusion above...

@MXs+ & @MXs+

This just says "connect to zero or more MXs to the right" and is functionally equivalent to a single @MXs+

hCO & @hCO

This means "connect to one or more hCO" and is NOT the same as @hCO. However, it seems likely that the actual dictionary entry is {hCO} & @hCO which is zero or more and thus equivalent to a single @hCO.

FWIW, + and - directions commute, so if there are expressions like @MXs+ & B- & @MXs+ the B- can be slid past the MX's, and then the two MX's can be combined. I don't know if there's any code that automatically slides all the -'s to be left of all the +'s... This sliding or "normal ordering" is possible only for disjuncts, and not, in general, for expressions.

See below for my question.

I don;'t see a question.

@ampli
Copy link
Member

ampli commented Feb 6, 2023

is functionally equivalent to a single @MXs+

Does @MXs actually mean "connect to one or more @MXs? ({@MXs} is zero or more.)

I don't see a question.

It is an implied one, whether the "it will be wrong to collapse ..." sentence is correct.

@linas
Copy link
Member Author

linas commented Feb 7, 2023

Heh.

@mxs actually mean "connect to one or more MXs

Yes it does. I was wrong; I misremembered. I was so used to seeing {@MXs} that I forgot the true meaning of "one or more"!

The following rewrites are wrong:

  • hCO @hCO to @hCO because the former is "two or more", the later is one or more.
  • @MX @MX to @MX is wrong because the former is "two or more"

Thus, it would appear that my earlier insistence that "they are the same" (in #1396 (comment) and other comments) is wrong.

And now that I see the error of my ways ... what to do? The disjuncts differ, but they give the same parse ... Hmmm .... perhaps you are correct, these should be considered to be different.

.. Should I patch this? Do you want to patch this?

@linas
Copy link
Member Author

linas commented Feb 7, 2023

I opened #1417 because otherwise, this conversation will get lost in history.

@ampli
Copy link
Member

ampli commented Feb 7, 2023

I suppose that the last two loops (including the DOUBLE_CHECK loop) are to be removed.

The question is how to sort the linkages (that happen to look the same) in that case.
Maybe we can compare the disjuncts by their connectors (string and multi), taking into account that the number of connectors may be different (e.g. return -1 if an earlier NULL is encountered).

@linas
Copy link
Member Author

linas commented Feb 7, 2023

I suppose that the last two loops (including the DOUBLE_CHECK loop) are to be removed.

??? I must be getting tired. The second-from-last loop lines 352 to 372, bails only if there are differences, so we want to have it.

// Compare connector types at the link endpoints. If we are here,
// then the link endpoints landed on the same words, and the link
// names were the same. The connector types might still differ,
// due to intersection. The multi-connector flag might differ.
// However, neither of these are likely. It is plausible to skip
// this check entirely, its mostly a CPU-time-waster that will
// never find any differences for the almost any situation.
for (uint32_t li=0; li<lpv->num_links; li++)
{
Link * plk = &lpv->link_array[li];
Link * nlk = &lnx->link_array[li];
if (plk->lc != nlk->lc)
{
int diff = strcmp(plk->lc->desc->string, nlk->lc->desc->string);
if (diff) return diff;
int md = plk->lc->multi - nlk->lc->multi;
if (md) return md;
}
if (plk->rc != nlk->rc)
{
int diff = strcmp(plk->rc->desc->string, nlk->rc->desc->string);
if (diff) return diff;
int md = plk->rc->multi - nlk->rc->multi;
if (md) return md;
}
}

We want to keep the above.

The last loop, commented out with #if DOUBLE_CHECK lines 373-383, we want to enable that loop. It will enable the detection of different disjuncts, and will also provide a sort order. Perhaps the linkage_get_disjunct_str() at line 380 is CPU-wasteful, and can be replaced by a more direct comparison?

@ampli
Copy link
Member

ampli commented Feb 7, 2023

We want to keep the above.

I now think that you are right.
But I still do not like the if (plk->lc != nlk->lc) and if (plk->rc != nlk->rc) comparisons, since they depend on whether connector encoding is used and whether it decides to share the addresses of particular connectors.
Why not use instead if (plk->rc->desc->string == nlk->rc->desc->string) etc.?

Perhaps the linkage_get_disjunct_str() at line 380 is CPU-wasteful, and can be replaced by a more direct comparison?

Yes. My proposed connector comparison (string + multi) seems more direct.
I mean something similar to disjuncts_equal() (but returning a sort-order integer).

@ampli
Copy link
Member

ampli commented Feb 7, 2023

I said:

Why not use instead if (plk->rc->desc->string == nlk->rc->desc->string) etc.?

if (plk->rc->desc == nlk->rc->desc) is even better.

linas added a commit to linas/link-grammar that referenced this pull request Feb 7, 2023
If the descs are the same, then we know the strings cannot differ.
Ths resolves
opencog#1396 (comment)
@linas
Copy link
Member Author

linas commented Feb 7, 2023

But I still do not like

Read the code more carefully. It uses != not == ... the optimization of checking the descs is in #1421

@ampli
Copy link
Member

ampli commented Feb 7, 2023

Read the code more carefully. It uses != not == ... the optimization of checking the descs is in #1421

I added a comment on that in your last PR.

@linas
Copy link
Member Author

linas commented Feb 7, 2023

I added a comment on that in your last PR.

Click on "finish review" else I don't see comments.

@ampli
Copy link
Member

ampli commented Feb 7, 2023

I added it directly on the commit (in the repository commit list). I think such comments are immediate.,

@ampli
Copy link
Member

ampli commented Feb 7, 2023

Click on "finish review" else I don't see comments.

I again forgot to do that...
Please check now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants