Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make BlockSegmentPostings::open and TermInfoStore public #2520

Closed
wants to merge 5 commits into from

Conversation

b41sh
Copy link
Contributor

@b41sh b41sh commented Oct 16, 2024

Databend uses tantivy to implement inverted index, and tantivy Searcher needs to read all the index file data when starting up, which is very large, resulting in poor query performance. To improve performance, we implemented the Searcher ourselves to query the matched docs, and improve performance by reading only the FST data, TermInfo data, and the matched terms related data to reduce the size of data to be read. We need to use BlockSegmentPostings and TermInfoStore, as well as some fields from the Query.

@fulmicoton
Copy link
Collaborator

Databend uses tantivy to implement inverted index, and tantivy Searcher needs to read all the index file data when starting up, which is very large, resulting in poor query performance. To improve performance, we implemented the Searcher ourselves to query the matched docs, and improve performance by reading only the FST data, TermInfo data, and the matched terms related data to reduce the size of data to be read. We need to use BlockSegmentPostings and TermInfoStore, as well as some fields from the Query.

I don't understand this paragraph. Can you explain in greater length what was your problem and why this helps?

Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments.

(probably the best would be to explain the motivation better - in the PR comments)

@b41sh
Copy link
Contributor Author

b41sh commented Oct 21, 2024

See comments.

(probably the best would be to explain the motivation better - in the PR comments)

Thanks to your reivew.
tantivy's Searcher is not suitable for the use of databend, mainly because the size of Postings and Position file is too big, which need a lot of time to read from S3 storage, resulting the poor performance. So we've implemented our own custom Searcher to solve this problem, and we need to uses some fields in the Query, but these fields are private. The purpose of this PR is to add some methods to make it easier to use these fields.

@b41sh b41sh requested a review from fulmicoton October 23, 2024 03:13
@fulmicoton
Copy link
Collaborator

@b41sh i still don't understand. I'm sorry. This won't get merged if there is a proper justification. Can you add a link to the code maybe?

@b41sh
Copy link
Contributor Author

b41sh commented Oct 23, 2024

@b41sh i still don't understand. I'm sorry. This won't get merged if there is a proper justification. Can you add a link to the code maybe?

@fulmicoton I'm sorry I didn't explain it clearly, maybe you can look at our implementation Searcher code here. We do this mainly for performance reasons, and maybe you can give us some advice.

@fulmicoton
Copy link
Collaborator

fulmicoton commented Oct 23, 2024

    // If the term does not match, only the `fst` file needs to be read.
    // If the term matches, the `term_dict` and `postings`, `positions`
    // data of the related terms need to be read instead of all
    // the `postings` and `positions` data.

@b41sh This is already the case with stock tantivy, isn't it?

tantivy might create FileSlice object that cover more, but FileSlice is just a view. The actual bytes request are just as small as what you do today.

You could just focus on the Directory abstraction.

@b41sh
Copy link
Contributor Author

b41sh commented Oct 23, 2024

    // If the term does not match, only the `fst` file needs to be read.
    // If the term matches, the `term_dict` and `postings`, `positions`
    // data of the related terms need to be read instead of all
    // the `postings` and `positions` data.

@b41sh This is already the case with stock tantivy, isn't it?

tantivy might create FileSlice object that cover more, but FileSlice is just a view. The actual bytes request are just as small as what you do today.

You could just focus on the Directory abstraction.

@fulmicoton Thank you for your advice, but I still have a problem I don't understand, please give me some advise.
In the function SegmentReader::inverted_index, the termdict_file, postings_file and positions_file are all open readed from the directory. For performance purposes, we can return some empty FileSlice instead of reading the real data. However, if some terms are matched, we need to read the term related data from the postings_file and positions_file by the term_info.postings_range and term_info.positions_range. But, at this point the InvertedIndexReader has been initialized and the empty postings_file and positions_file cannot be changed. How can I reload the postings_file and positions_file with only the needed slice range after terms are matched?

@b41sh
Copy link
Contributor Author

b41sh commented Oct 24, 2024

I close this PR first. I am still not familiar with tantivy. I will continue to find appropriate ways to solve these problems. Thank you for your advice. @fulmicoton

@b41sh b41sh closed this Oct 24, 2024
@fulmicoton
Copy link
Collaborator

You could implement your own Directory, that emits its own implementation of file handle.

Your FileHandle would not hold any data at all. Instead it would just be holding the address of your file
(whatever makes sense in your case... a filepath, a uri, etc.). The point is that no IO needs to be done.

You then only need to implement .read_bytes(..) (ignore .read_bytes_async it is only used in quickwit).
https://github.com/quickwit-oss/tantivy/blob/main/common/src/file_slice.rs#L25

This one will always be called on a range that is as tight as possible.

@b41sh
Copy link
Contributor Author

b41sh commented Oct 25, 2024

You could implement your own Directory, that emits its own implementation of file handle.

Your FileHandle would not hold any data at all. Instead it would just be holding the address of your file (whatever makes sense in your case... a filepath, a uri, etc.). The point is that no IO needs to be done.

You then only need to implement .read_bytes(..) (ignore .read_bytes_async it is only used in quickwit). https://github.com/quickwit-oss/tantivy/blob/main/common/src/file_slice.rs#L25

This one will always be called on a range that is as tight as possible.

Thanks, I will try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants