-
-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow indexing #2156
Comments
Are you sure that the is no alternative explanation like debug versus release builds? Personally, I cannot reproduce this as indexing about 25.000 documents yields > RAYON_NUM_THREADS=1 DATA_PATH=data hyperfine ./indexer-0.20.2 ./indexer-master
Benchmark 1: ./indexer-0.20.2
Time (mean ± σ): 482.3 ms ± 8.4 ms [User: 1952.9 ms, System: 552.3 ms]
Range (min … max): 470.8 ms … 494.1 ms 10 runs
Benchmark 2: ./indexer-master
Time (mean ± σ): 482.0 ms ± 5.6 ms [User: 1945.3 ms, System: 555.2 ms]
Range (min … max): 475.6 ms … 493.1 ms 10 runs
Summary
./indexer-master ran
1.00 ± 0.02 times faster than ./indexer-0.20.2 where the only difference is whether the Tantivy dependency is |
I don't think so. The only change I made, is in the Cargo.toml file:
When I follow in the code the time consumption, the difference is on this command:
|
Is it on the commit on or on acquiring the lock? |
@iinov Can you provide something to reproduce? |
Here is a little benchmark:
|
The flamegraph looks regular Do you have a full example to reproduce? |
Here is the application: CioTantivy.tar.gz You can run the server like this:
And then, in another console, create an an index:
Thank you. |
A example to reproduce should include code that creates the index and the data. |
You already have the code. Some data: Sandbox.tar.gz
|
May be this information can help you. I looked in my log and saw that I'd gone from main branch to 20.2, to workaround the problem, on August 21. Another thing, you can focus on line 160 of src/engine/index_pass1.rs to see the slowdown. Thank you for your help. |
Can you provide a minimal example that demonstrates the slowdown?
This is running in debug mode without |
Sorry for the dependency. Here is a version without RXml : CioTantivyLite.tar.gz I have the same problem with the 0.21.0 as with the main branch: very slow indexing compared to version 0.20.2. |
Still not compiling
This is running in debug mode without --release. Do you benchmark without |
Maybe to expand on why this is important: Even if you changed nothing but the Tantivy version, if had reasonably fast results using debug builds before, that does not mean anything. Completely unrelated code changes can be produce slower debug builds, for example because some abstraction that is completely compiled away in release builds is newly used and massively slows down debug builds. Long story, it does not make sense to measure the performance of debug builds not even relative to each other. You must use release builds before and after the relevant change. |
Maybe this can help : https://crates.io/crates/graphicsmagick I tried in release mode. I have the same kind of difference: 0.05846s (0.20.2) vs 4.8779s (main branch). To reproduce the slowdown, the first thing to do is to launch the server:
Then, in another console:
Where Then, you do the same operations but change I think the problem is on line 160 of Thanks to both of you. |
I think you indexing buffers became too small for Tantivy 0.21: I replaced your tracing setup by fn tracing_initialize(settings: &Settings) -> BoxDynResult<Option<WorkerGuard>> {
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};
tracing_subscriber::registry()
.with(tracing_subscriber::EnvFilter::from_default_env())
.with(tracing_subscriber::fmt::layer())
.init();
Ok(None)
} which together with 50 MB as Whereas setting As a side remark, this has been made much more difficult for us because of the non-minimal reproducer. Especially as the problem does not depend on the context of running inside your server application, i.e. extracting some code doing the actual indexing using the same parameters would have made it easier to diagnose this. |
I have just tried to set the buffer to 250 MB, even to 200 MB, and it works as with the 20.2 release. Many thanks for your efficiency... and sorry for the lack of simplicity in the provided application. |
Sorry, I always have a problem. Now, with 550 files I have a small difference between 20.2 (.07s) and main branch (0.10s). But, with 7500 files, I have a huge difference: 0.62s (20.2) vs 80.15s (main branch), even with a buffer of 1GB. |
Please provide a minimal reproducible example, that includes creating an index and some docs |
Here is a sample of 3645 files: Sandbox.tar.gz. In release mode, with a buffer of 1GB (you can change it in the configuration file
You already have a version of the application. If it is too complex for you to debug with this version, we can extract a more simple one. But, it a lot of work for us. We have another idea. We know approximately the date on which the problem appeared. We could try to find the exact release which produces the issue. What do you think about this? |
Bisecting it down to a specific commit might be helpful, yes. But the whole application is definitely not a minimal reproducer. |
Here is a very simple application showing the issue: CioDebug.tar.gz The problem seems to be the amount of memory needed by the new version of Tantivy. With the 0.20.2, the standard amount of 50MB is sufficient to quickly index 5 000 documents. With the new version, even with double buffer, indexing is very slow.
Of course, if you increase the size of the buffer, you can obtain almost the same performance as with the version 0.20.2. But, it's a never-ending race. I hope this small application can help you to find a solution... |
I think for the above test application, the problem is that for 100 MB buffer size, it will commit after each document, e.g. 2023-09-09T12:47:05.015812Z INFO tantivy::indexer::index_writer: Buffer limit reached, flushing segment with maxdoc=1. but even increasing this to just 150 MB means only a single commit is necessary, so this seems to be based on the fixed minimum size due to the given schema, not the number of documents. I suspect that these small defaults are just not reasonable any more (maybe due to the new columnar storage format) and significantly larger buffers are just necessary with Tantivy 0.20.x. Also note that the old memory limits were rather inaccurate due to the memory accounting fixes linked above, i.e. the indexer actually consumed more memory than configured. So increasing the buffer size now does not really increase the memory consumption of your service, it just makes it more explicit/controllable. |
With the fixed memory tracking of fast field buffers, the baseline memory consumption per thread is 13MB. Setting it to 150MB fixes the issue, so your setting in I'll prepare a commit, that enforces the minimum memory per thread to be at least 15MB. |
With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to 12MB. 7MB are for the different fast field collectors types (they could be lazily created). Increase the minimum memory from 3MB to 15MB. Change memory variable naming from arena to budget. closes #2156
I understand. I increase the minimal amount of memory for the indexing buffer from 50MB to 150MB. Thanks a lot. |
With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to 12MB. 7MB are for the different fast field collectors types (they could be lazily created). Increase the minimum memory from 3MB to 15MB. Change memory variable naming from arena to budget. closes #2156
Newer versions of tantivy require more memory during the indexing phase. Otherwise the indexing phase will be a lot slower than in previous versions. See quickwit-oss/tantivy#2156 for details.
Newer versions of tantivy require more memory during the indexing phase. Otherwise the indexing phase will be a lot slower than in previous versions. See quickwit-oss/tantivy#2156 for details.
With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to 12MB. 7MB are for the different fast field collectors types (they could be lazily created). Increase the minimum memory from 3MB to 15MB. Change memory variable naming from arena to budget. closes #2156
Describe the bug
In recent days, indexing has become very slow. If I use the release 20.2, it takes 0.07s to index 550 files. If I use the main branch, it takes about 5s to index same files!
Which version of tantivy are you using?
master vs 20.2
To Reproduce
Index few files with le main branch vs the last stable release.
The text was updated successfully, but these errors were encountered: