Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backlog for "Features" section #101

Open
2 of 6 tasks
amotl opened this issue Jul 20, 2024 · 4 comments
Open
2 of 6 tasks

Backlog for "Features" section #101

amotl opened this issue Jul 20, 2024 · 4 comments

Comments

@amotl
Copy link
Member

amotl commented Jul 20, 2024

About

Coming from GH-53, there are a few backlog items, and there will be more.

Details

@amotl
Copy link
Member Author

amotl commented Jul 22, 2024

Discussion about Indexes

Hi guys, I'd like to understand better CrateDB indexes, I have a few questions, thanks in advance.

[1] https://cratedb.com/blog/indexing-and-storage-in-cratedb
[2] https://rockset.com/blog/converged-indexing-the-secret-sauce-behind-rocksets-fast-queries/

Questions

  1. In [1] we say that "Inverted Indexes for text values, BKD-Trees for numeric values, and Doc Values." Is this still accurate in 2024 or do we implement any other data structure?
  2. If I understood well in [1], given a column with default index (plain), do we build into an inverted index + columnar (doc values) for text values, for example, or is doc values reserved only for things like objects/arrays? Maybe another way of putting the question is, do we only use one index data structure per datatype/column, or do we apply more than one and then maybe choose to query one or another depending of the user's query in the optimizer?
  3. In [1] "...new documents are added to the existing index, they are added to the next segment ...the system may decide to merge some segments ...adding a new document does not require rebuilding the index structure" Is this the reason why our index-all strategy does not affect insert performance as much as other databases when you manually set to build index in several columns? or is there any other strategy in place to mitigate the overhead of indexing every column?
  4. In [2] they defined converged index as: row (LSM trees) + columnar + search (posting lists), how fair would be to say that we also 'implement' converged index in your opinion?
  5. In [1] "...CrateDB implements Column Store based on Doc Values in Lucene" Does this mean that we just use Lucene's DocValues or do we wrote our own based/inspired on it?

Answers

  1. This is still accurate.
  2. Everything gets doc values; numeric and geo types also get a BKD index; text and index fields also get postings lists. And yes, which index structure is used depends on the query.
  3. Indexing is in general fast because it uses lucene, which is optimized for fast writes. The segment-based structure means that all index files in a segment are written just once (with the exception of deletes), and segment merging is fast (mostly just concatenation) and done asynchronously - this is an old but still useful way of thinking about how the index is written.
  4. We don't use LSM trees, we use BKD trees which are I guess sort of similar so you could argue that we have something which at the least 'looks like' a converged index.
  5. We use the lucene implementations, with one minor change to use a best-speed rather than a best-compression algorithm in text field doc values.

Thoughts

  • Thanks for that Q&A, @surister and @romseygeek.
  • Slot into the "feature/storage" section, in one way or another.

References

@amotl
Copy link
Member Author

amotl commented Jul 22, 2024

At crate-clients-tools, specifically CI run #9710088632.

WARNING: undefined label: 'guide:metrics'

@amotl
Copy link
Member Author

amotl commented Jul 30, 2024

@surister suggested at #106 (comment):

At the beginning of the page about Hybrid Search, you talk about how vectors is not enough hence we need to mix with bm25, this is very well written in the description of https://haystackconf.com/us2023/talk-16/, maybe it can serve as an inspiration?

Thanks!

@amotl
Copy link
Member Author

amotl commented Aug 1, 2024

My personal immediate favourite backlog items for the All Features at a Glance page would be:

  • Improve "Highlights" section: Add info cards about Hybrid Index and Hybrid Search. Add performance details, like the recent blog post by Henrik about it.
  • Think about renaming »Document Store« => »Document / JSON«.
  • Think about renaming »Relational / JOINs« => »Distributed Joins«.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant