feat: make `BlockSegmentPostings::open` and `TermInfoStore` public #2520

b41sh · 2024-10-16T14:29:01Z

Databend uses tantivy to implement inverted index, and tantivy Searcher needs to read all the index file data when starting up, which is very large, resulting in poor query performance. To improve performance, we implemented the Searcher ourselves to query the matched docs, and improve performance by reading only the FST data, TermInfo data, and the matched terms related data to reduce the size of data to be read. We need to use BlockSegmentPostings and TermInfoStore, as well as some fields from the Query.

src/query/boost_query.rs

src/query/const_score_query.rs

src/query/boost_query.rs

src/query/phrase_query/phrase_query.rs

src/query/phrase_prefix_query/phrase_prefix_query.rs

src/query/const_score_query.rs

fulmicoton · 2024-10-21T09:56:26Z

Databend uses tantivy to implement inverted index, and tantivy Searcher needs to read all the index file data when starting up, which is very large, resulting in poor query performance. To improve performance, we implemented the Searcher ourselves to query the matched docs, and improve performance by reading only the FST data, TermInfo data, and the matched terms related data to reduce the size of data to be read. We need to use BlockSegmentPostings and TermInfoStore, as well as some fields from the Query.

I don't understand this paragraph. Can you explain in greater length what was your problem and why this helps?

fulmicoton

See comments.

(probably the best would be to explain the motivation better - in the PR comments)

b41sh · 2024-10-21T12:47:33Z

See comments.

(probably the best would be to explain the motivation better - in the PR comments)

Thanks to your reivew.
tantivy's Searcher is not suitable for the use of databend, mainly because the size of Postings and Position file is too big, which need a lot of time to read from S3 storage, resulting the poor performance. So we've implemented our own custom Searcher to solve this problem, and we need to uses some fields in the Query, but these fields are private. The purpose of this PR is to add some methods to make it easier to use these fields.

fulmicoton · 2024-10-23T03:35:42Z

@b41sh i still don't understand. I'm sorry. This won't get merged if there is a proper justification. Can you add a link to the code maybe?

b41sh · 2024-10-23T07:49:16Z

@b41sh i still don't understand. I'm sorry. This won't get merged if there is a proper justification. Can you add a link to the code maybe?

@fulmicoton I'm sorry I didn't explain it clearly, maybe you can look at our implementation Searcher code here. We do this mainly for performance reasons, and maybe you can give us some advice.

fulmicoton · 2024-10-23T08:41:30Z

    // If the term does not match, only the `fst` file needs to be read.
    // If the term matches, the `term_dict` and `postings`, `positions`
    // data of the related terms need to be read instead of all
    // the `postings` and `positions` data.

@b41sh This is already the case with stock tantivy, isn't it?

tantivy might create FileSlice object that cover more, but FileSlice is just a view. The actual bytes request are just as small as what you do today.

You could just focus on the Directory abstraction.

b41sh · 2024-10-23T16:07:44Z

    // If the term does not match, only the `fst` file needs to be read.
    // If the term matches, the `term_dict` and `postings`, `positions`
    // data of the related terms need to be read instead of all
    // the `postings` and `positions` data.
@b41sh This is already the case with stock tantivy, isn't it?

tantivy might create FileSlice object that cover more, but FileSlice is just a view. The actual bytes request are just as small as what you do today.

You could just focus on the Directory abstraction.

@fulmicoton Thank you for your advice, but I still have a problem I don't understand, please give me some advise.
In the function SegmentReader::inverted_index, the termdict_file, postings_file and positions_file are all open readed from the directory. For performance purposes, we can return some empty FileSlice instead of reading the real data. However, if some terms are matched, we need to read the term related data from the postings_file and positions_file by the term_info.postings_range and term_info.positions_range. But, at this point the InvertedIndexReader has been initialized and the empty postings_file and positions_file cannot be changed. How can I reload the postings_file and positions_file with only the needed slice range after terms are matched?

b41sh · 2024-10-24T11:17:36Z

I close this PR first. I am still not familiar with tantivy. I will continue to find appropriate ways to solve these problems. Thank you for your advice. @fulmicoton

fulmicoton · 2024-10-24T11:40:18Z

You could implement your own Directory, that emits its own implementation of file handle.

Your FileHandle would not hold any data at all. Instead it would just be holding the address of your file
(whatever makes sense in your case... a filepath, a uri, etc.). The point is that no IO needs to be done.

You then only need to implement .read_bytes(..) (ignore .read_bytes_async it is only used in quickwit).
https://github.com/quickwit-oss/tantivy/blob/main/common/src/file_slice.rs#L25

This one will always be called on a range that is as tight as possible.

b41sh · 2024-10-25T03:27:29Z

You could implement your own Directory, that emits its own implementation of file handle.

Your FileHandle would not hold any data at all. Instead it would just be holding the address of your file (whatever makes sense in your case... a filepath, a uri, etc.). The point is that no IO needs to be done.

You then only need to implement .read_bytes(..) (ignore .read_bytes_async it is only used in quickwit). https://github.com/quickwit-oss/tantivy/blob/main/common/src/file_slice.rs#L25

This one will always be called on a range that is as tight as possible.

Thanks, I will try.

b41sh added 3 commits September 17, 2024 03:20

make some function pub

37aeac0

add some functions

7502370

fix

cb54ca9