Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow writer performance with the current default heap size #118

Closed
cjrh opened this issue Sep 9, 2023 · 3 comments
Closed

Slow writer performance with the current default heap size #118

cjrh opened this issue Sep 9, 2023 · 3 comments
Assignees

Comments

@cjrh
Copy link
Collaborator

cjrh commented Sep 9, 2023

@adamreichold Circling back to this discussion.

While upgrading another application to use current head tantivy-py, I am finding that the default heap limit of 3000000 seems to cause very frequent commits while adding documents. It just doesn't seem large enough. I can improve performance by increasing the heap size, but I'm thinking the current default is going cause surprisingly poor performance for a lot of people once they upgrade.

What are your thoughts on this? Is there a more typical "good" value to use as a default? I am not familiar with the tantivy work between 0.19.2 and 0.20.1 that led to this apparent change in behaviour.

@cjrh
Copy link
Collaborator Author

cjrh commented Sep 9, 2023

The tantivy docs for the writer settings don't describe the consequences of setting the heap larger or smaller. I'd be happy to make improvements to those docs once I understand those consequences myself ;)

@cjrh
Copy link
Collaborator Author

cjrh commented Sep 9, 2023

Based on reading some threads on discord, is this the same setting on quickwit, that is currently default to 2GB? https://quickwit.io/docs/configuration/index-config#indexer-memory-usage

@adamreichold
Copy link
Collaborator

Please have a look at the thread over at quickwit-oss/tantivy#2156 (comment)

The main point is that the memory accounting got more accurate, meaning the indexer used to use more memory than configured per the buffer limit. Now it is much closer to staying within that limit but this also means that the same nominal limit implies less buffering and more commits which is what you are experiencing.

I think the main thing here is that the Rust bindings force one to make a choice via the mandatory memory_arena_num_bytes parameter whereas the Python supply basically a minimum value as the default value. So indeed I think it would make sense to decrease this significantly to a reasonable default like 128 MB or even 1 GB. In addition we should document though, the an actually helpful value needs to be measured as it depends on the schema and the data.

(Additionally, I think the actual memory consumption has somewhat increased due to the new columnar fast field storage. But whether this really affects a given use case also depends on the schema and data in question.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants