Backlog for "Features" section #101

amotl · 2024-07-20T21:33:14Z

About

Coming from GH-53, there are a few backlog items, and there will be more.

Details

Features: Improve feature pages which are a bit thin, yet.
- Origin: Evolve "Features" and "Application Domains" sections #53 (review)
Feature / Search: Improve layout
- Origin: Evolve "Features" and "Application Domains" sections #53 (comment) by @surister
Feature / Vector: Add example using euclidean distance function
- Origin: Evolve "Features" and "Application Domains" sections #53 (comment) by @amotl
Feature / Geo: What about querying with "donut" shapes?
- Origin: Evolve "Features" and "Application Domains" sections #53 (review) by @seut
Feature / Search: Hybrid Search
- Origin: Evolve "Features" and "Application Domains" sections #53 (review)
- Worklog: Add article https://cratedb.com/blog/hybrid-search-in-cratedb by @surister
- Done: Fixed with Feature/Search: Reorganize section, and add content about hybrid search #106
Feature / Index: Hybrid Index
- Origin: https://cratedb.com/docs/guide/feature/index/
- Worklog: Use singular "Hybrid Index" instead of the plural form by @geragray
- Done: Fixed with c4fba3a.

The text was updated successfully, but these errors were encountered:

amotl · 2024-07-22T18:22:57Z

Discussion about Indexes

Hi guys, I'd like to understand better CrateDB indexes, I have a few questions, thanks in advance.

[1] https://cratedb.com/blog/indexing-and-storage-in-cratedb
[2] https://rockset.com/blog/converged-indexing-the-secret-sauce-behind-rocksets-fast-queries/

Questions

In [1] we say that "Inverted Indexes for text values, BKD-Trees for numeric values, and Doc Values." Is this still accurate in 2024 or do we implement any other data structure?
If I understood well in [1], given a column with default index (plain), do we build into an inverted index + columnar (doc values) for text values, for example, or is doc values reserved only for things like objects/arrays? Maybe another way of putting the question is, do we only use one index data structure per datatype/column, or do we apply more than one and then maybe choose to query one or another depending of the user's query in the optimizer?
In [1] "...new documents are added to the existing index, they are added to the next segment ...the system may decide to merge some segments ...adding a new document does not require rebuilding the index structure" Is this the reason why our index-all strategy does not affect insert performance as much as other databases when you manually set to build index in several columns? or is there any other strategy in place to mitigate the overhead of indexing every column?
In [2] they defined converged index as: row (LSM trees) + columnar + search (posting lists), how fair would be to say that we also 'implement' converged index in your opinion?
In [1] "...CrateDB implements Column Store based on Doc Values in Lucene" Does this mean that we just use Lucene's DocValues or do we wrote our own based/inspired on it?

Answers

This is still accurate.
Everything gets doc values; numeric and geo types also get a BKD index; text and index fields also get postings lists. And yes, which index structure is used depends on the query.
Indexing is in general fast because it uses lucene, which is optimized for fast writes. The segment-based structure means that all index files in a segment are written just once (with the exception of deletes), and segment merging is fast (mostly just concatenation) and done asynchronously - this is an old but still useful way of thinking about how the index is written.
We don't use LSM trees, we use BKD trees which are I guess sort of similar so you could argue that we have something which at the least 'looks like' a converged index.
We use the lucene implementations, with one minor change to use a best-speed rather than a best-compression algorithm in text field doc values.

Thoughts

Thanks for that Q&A, @surister and @romseygeek.
Slot into the "feature/storage" section, in one way or another.

References

Evolve "Features" and "Application Domains" sections #53 (comment)

amotl · 2024-07-22T20:11:47Z

At crate-clients-tools, specifically CI run #9710088632.

WARNING: undefined label: 'guide:metrics'

amotl · 2024-07-30T17:29:23Z

@surister suggested at #106 (comment):

At the beginning of the page about Hybrid Search, you talk about how vectors is not enough hence we need to mix with bm25, this is very well written in the description of https://haystackconf.com/us2023/talk-16/, maybe it can serve as an inspiration?

Thanks!

amotl · 2024-08-01T13:52:40Z

My personal immediate favourite backlog items for the All Features at a Glance page would be:

Improve "Highlights" section: Add info cards about Hybrid Index and Hybrid Search. Add performance details, like the recent blog post by Henrik about it.
Think about renaming »Document Store« => »Document / JSON«.
Think about renaming »Relational / JOINs« => »Distributed Joins«.

amotl mentioned this issue Jul 20, 2024

Evolve "Features" and "Application Domains" sections #53

Merged

amotl mentioned this issue Jul 30, 2024

Feature/Search: Reorganize section, and add content about hybrid search #106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backlog for "Features" section #101

Backlog for "Features" section #101

amotl commented Jul 20, 2024 •

edited

Loading

amotl commented Jul 22, 2024 •

edited

Loading

amotl commented Jul 22, 2024

amotl commented Jul 30, 2024

amotl commented Aug 1, 2024

Backlog for "Features" section #101

Backlog for "Features" section #101

Comments

amotl commented Jul 20, 2024 • edited Loading

About

Details

amotl commented Jul 22, 2024 • edited Loading

Discussion about Indexes

Questions

Answers

Thoughts

References

amotl commented Jul 22, 2024

amotl commented Jul 30, 2024

amotl commented Aug 1, 2024

amotl commented Jul 20, 2024 •

edited

Loading

amotl commented Jul 22, 2024 •

edited

Loading