Multi column indexes are only used if all columns share operation #261

erizocosmico · 2018-07-04T13:23:29Z

Because our methods for Index accept ...interface{}, where the len of that slice is the length of columns in the index, all those methods require all the values, one for each column.

This creates a problem: all columns must use the exact same operation.

For example, consider we have an index on A and B:

A = 1 AND B = 1 will use the index.
A > 1 AND B > 5 will use the index.
A = 1 AND B < 5 will not, because = and < are not the same operation.

The text was updated successfully, but these errors were encountered:

erizocosmico · 2018-07-04T13:27:29Z

Closing, as discussed via slack with @ajnavarro this is expected behaviour.

ajnavarro · 2018-07-04T14:19:42Z

maybe we should have a look of how is working on mysql or postgres: https://dev.mysql.com/doc/refman/8.0/en/multiple-column-indexes.html

erizocosmico · 2018-07-06T07:54:44Z

In MySQL, it uses the index as long as the value of the leftmost column is specified.

e.g., for an index on (a, b), it would be used for a = 1 AND b > 5

erizocosmico · 2018-07-12T12:24:50Z

What are we going to do with this in the end @ajnavarro?

ajnavarro · 2018-07-12T12:40:24Z

Let's keep that issue open until we find a good solution for that. Right now is not a high priority.

kuba-- · 2018-10-16T08:56:07Z

What if we do not have indexes on multiple columns (internally), so the index on (A, B) would be the same as index on A and index on B.
Ultimately we have to merge 2 indexes (what is pretty fast). Last but not least, it will also dedup. indexes. Right now if you create an index on A, B, C, (A, B), (A, B, C) it means 5 indexes, but if we treat them independently it will be still just 3 indexes (1 per column).
WDYT?

erizocosmico · 2018-11-05T13:53:49Z

I think that might work. I can't think of any reason this may be bad right now. In fact, it should actually help with updates (less indexes to update once repos change). Only downside I can see is the fact that you will have to check all remaining indexes to see if you can delete a pilosa index or not, because it may be in use in another gitbase index.

WDYT @src-d/data-retrieval?

ajnavarro · 2018-11-05T14:00:52Z

If it is internally, at pilosa implementation level, I'm totally in on that.

We should save on pilosa metadata the pilosa indexes that go-mysql-server indexes are using to know if we can delete the pilosa index on a DELETE INDEX statement or not.

kuba-- · 2018-11-08T12:22:49Z

Ok, so I'll start prototype the idea of dedup indexes (#261 (comment))

kuba-- · 2018-11-15T13:12:35Z

I noticed a small thing to improve. Having the following index:

CREATE INDEX email_idx ON commits USING pilosa (commit_author_email, commit_author_name);

Following AND query uses the index:

SELECT * FROM commits WHERE commit_author_email='...' AND commit_author_name='...';

but OR query doesn't!

SELECT * FROM commits WHERE commit_author_email='...' OR commit_author_name='...';

but actually, if we have independent indexes (so in this case 2) then for logic operations we may always merge 2 indexes.

kuba-- · 2018-11-15T19:52:53Z

It's quite easy to fix the problem with index on (A, B) which can be used only with one condition, e.g.:

WHERE A = '...'

it may require some convention between driver and analyzer (for instance we may always pass to index lookup as many keys as index expressions but we have to keep the order and lookup will skip nil keys), e.g.:

// index (A, B), WHERE A=5
index.Get({5, nil})

ajnavarro · 2018-11-16T09:30:01Z

But if you create one index with two columns, we shouldn`t try to use the index with only one of the columns.

This can break the intended way to communicate between the Analyzer and the different kind of indexes. Not all indexes can do the same as we are doing with pilosa index, so we should keep the common interface with some constraints.

kuba-- · 2018-11-16T09:59:57Z

@ajnavarro - right, it was just experiment, because with bitmaps it's doable.
But actually with (A, B) only A AND B works. If you need A OR B you need to create A, B independently.

kuba-- · 2018-11-16T10:02:30Z

Generally with bitmaps I don't see benefits of having multi-column indexes. It works better if we have index per expression and merge them because merging bitmaps is super fast and it's more flexible from composition point of view.

kuba-- · 2018-11-16T10:06:29Z

Notes from slack:

If you create one index on expressions (A, B) , actually we don't index tuples, but independently A values and B values as pilosa fields, so under the hood they are already independent structures.
But in our index driver API we require a list of expressions which were specified at creating time, (moreover we require them in the same order) to internally compute intersection for all expressions.
Because we use bitmaps per expression value, combining multiple expression into one index is kind of artificial, under the hood.
For our internals having index on (A, B) is exactly the same as having index on (A) and on (B). Although, for multi-expression indexes every time we compute intersection (A and B). What means that for non-and queries, e.g.:
WHERE A='...' OR B='...' the index won't be used. For independent indexes we can always merge 2 indexes (lookup) by some logic operation, so it will work for AND, OR, ...
Right now if we try to re-use (A, B) index to compute A OR B it will try to do: (A AND B) OR B what is incorrect.
So, I propose that index driver will work only with one expression (so api will simplify a little bit, moreover lookup will not have to compute internally intersects), but analyzer/registry will keep kind of mapping:

[A, B] --> A, B
              ^
[B, C]--------|----> C

In this case for requests WHERE B = '...' analyzer will have to check if we have indexed B expression.
And if you register an index (A, B) instead of getting one lookup with 2 expressions, we'll return 2 lookups (one per expression).

kuba-- · 2018-11-27T23:57:27Z

@erizocosmico - where is the main problem of using indexes with 2 different operations? E.g.:

A = 1 AND B < 5

and if we have index per columns (A), (B) instead of one (A, B) it will work?

erizocosmico · 2018-11-28T07:12:39Z

There is no problem, we just only did it when they share operations.
It would work if we have (A), (B), yes.

erizocosmico added the bug Something isn't working label Jul 4, 2018

erizocosmico closed this as completed Jul 4, 2018

ajnavarro reopened this Jul 4, 2018

mcarmonaa mentioned this issue Jul 10, 2018

Indexes don't handle negation correctly #262

Closed

erizocosmico assigned erizocosmico and unassigned erizocosmico Jul 18, 2018

kuba-- self-assigned this Nov 8, 2018

kuba-- added the wip work in progress label Nov 8, 2018

kuba-- mentioned this issue Nov 13, 2018

Fix 261/multiindex #547

Closed

kuba-- added blocked Some other issue is blocking this and removed bug Something isn't working labels Dec 14, 2018

kuba-- removed their assignment May 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi column indexes are only used if all columns share operation #261

Multi column indexes are only used if all columns share operation #261

erizocosmico commented Jul 4, 2018

erizocosmico commented Jul 4, 2018

ajnavarro commented Jul 4, 2018

erizocosmico commented Jul 6, 2018

erizocosmico commented Jul 12, 2018

ajnavarro commented Jul 12, 2018

kuba-- commented Oct 16, 2018

erizocosmico commented Nov 5, 2018

ajnavarro commented Nov 5, 2018

kuba-- commented Nov 8, 2018 •

edited

Loading

kuba-- commented Nov 15, 2018 •

edited

Loading

kuba-- commented Nov 15, 2018 •

edited

Loading

ajnavarro commented Nov 16, 2018

kuba-- commented Nov 16, 2018

kuba-- commented Nov 16, 2018

kuba-- commented Nov 16, 2018 •

edited

Loading

kuba-- commented Nov 27, 2018

erizocosmico commented Nov 28, 2018

Multi column indexes are only used if all columns share operation #261

Multi column indexes are only used if all columns share operation #261

Comments

erizocosmico commented Jul 4, 2018

erizocosmico commented Jul 4, 2018

ajnavarro commented Jul 4, 2018

erizocosmico commented Jul 6, 2018

erizocosmico commented Jul 12, 2018

ajnavarro commented Jul 12, 2018

kuba-- commented Oct 16, 2018

erizocosmico commented Nov 5, 2018

ajnavarro commented Nov 5, 2018

kuba-- commented Nov 8, 2018 • edited Loading

kuba-- commented Nov 15, 2018 • edited Loading

kuba-- commented Nov 15, 2018 • edited Loading

ajnavarro commented Nov 16, 2018

kuba-- commented Nov 16, 2018

kuba-- commented Nov 16, 2018

kuba-- commented Nov 16, 2018 • edited Loading

Notes from slack:

kuba-- commented Nov 27, 2018

erizocosmico commented Nov 28, 2018

kuba-- commented Nov 8, 2018 •

edited

Loading

kuba-- commented Nov 15, 2018 •

edited

Loading

kuba-- commented Nov 15, 2018 •

edited

Loading

kuba-- commented Nov 16, 2018 •

edited

Loading