Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow indexing #2156

Closed
iinov opened this issue Aug 28, 2023 · 27 comments · Fixed by #2176
Closed

Slow indexing #2156

iinov opened this issue Aug 28, 2023 · 27 comments · Fixed by #2176

Comments

@iinov
Copy link

iinov commented Aug 28, 2023

Describe the bug
In recent days, indexing has become very slow. If I use the release 20.2, it takes 0.07s to index 550 files. If I use the main branch, it takes about 5s to index same files!

Which version of tantivy are you using?
master vs 20.2

To Reproduce
Index few files with le main branch vs the last stable release.

@adamreichold
Copy link
Collaborator

adamreichold commented Aug 28, 2023

Are you sure that the is no alternative explanation like debug versus release builds?

Personally, I cannot reproduce this as indexing about 25.000 documents yields

> RAYON_NUM_THREADS=1 DATA_PATH=data hyperfine ./indexer-0.20.2 ./indexer-master 
Benchmark 1: ./indexer-0.20.2
  Time (mean ± σ):     482.3 ms ±   8.4 ms    [User: 1952.9 ms, System: 552.3 ms]
  Range (min … max):   470.8 ms … 494.1 ms    10 runs
 
Benchmark 2: ./indexer-master
  Time (mean ± σ):     482.0 ms ±   5.6 ms    [User: 1945.3 ms, System: 555.2 ms]
  Range (min … max):   475.6 ms … 493.1 ms    10 runs
 
Summary
  ./indexer-master ran
    1.00 ± 0.02 times faster than ./indexer-0.20.2

where the only difference is whether the Tantivy dependency is version = "0.20" versus git = "https://github.com/quickwit-oss/tantivy.git".

@iinov
Copy link
Author

iinov commented Aug 28, 2023

I don't think so. The only change I made, is in the Cargo.toml file:

tantivy = "0.20.2" -> tantivy = { git = "https://github.com/quickwit-oss/tantivy.git" }

When I follow in the code the time consumption, the difference is on this command:

   // Commit
   let mut index_writer_wlock = self.index_writer.write().await;
   index_writer_wlock.commit()?;

@adamreichold
Copy link
Collaborator

Is it on the commit on or on acquiring the lock?

@PSeitz
Copy link
Contributor

PSeitz commented Aug 28, 2023

@iinov Can you provide something to reproduce?
Or run it with flamegraph, so we can see where the time is spent.

@iinov
Copy link
Author

iinov commented Aug 28, 2023

Here is a little benchmark:

   let now = Instant::now();
   let mut index_writer_wlock = self.index_writer.write().await;
   info!("locked: {}s", now.elapsed().as_secs_f32());
   index_writer_wlock.commit()?;
   info!("commited: {}s", now.elapsed().as_secs_f32());
  • With 20.2:

2023-08-28T12:16:14.021024Z INFO index_pass1{tid="demo"}: locked: 0.00000593s
2023-08-28T12:16:14.080967Z INFO index_pass1{tid="demo"}: commited: 0.05994658s

  • With main branch:

2023-08-28T12:14:27.494617Z INFO index_pass1{tid="demo"}: locked: 0.00000541s
2023-08-28T12:14:33.189301Z INFO index_pass1{tid="demo"}: commited: 5.694685s

@iinov
Copy link
Author

iinov commented Aug 28, 2023

Here is the flamegraph SVG.
flamegraph

@PSeitz
Copy link
Contributor

PSeitz commented Aug 28, 2023

The flamegraph looks regular

Do you have a full example to reproduce?

@iinov
Copy link
Author

iinov commented Aug 28, 2023

Here is the application: CioTantivy.tar.gz

You can run the server like this:

cargo run serve

And then, in another console, create an an index:

cargo run index demo /directory/of/files --no-thumbs --force

Thank you.

@PSeitz
Copy link
Contributor

PSeitz commented Aug 29, 2023

A example to reproduce should include code that creates the index and the data.

@iinov
Copy link
Author

iinov commented Aug 29, 2023

You already have the code.

Some data: Sandbox.tar.gz

cargo run index demo Sandbox --no-thumbs --force

@iinov
Copy link
Author

iinov commented Aug 31, 2023

May be this information can help you.

I looked in my log and saw that I'd gone from main branch to 20.2, to workaround the problem, on August 21. Another thing, you can focus on line 160 of src/engine/index_pass1.rs to see the slowdown.

Thank you for your help.

@PSeitz
Copy link
Contributor

PSeitz commented Aug 31, 2023

Updating git repository `ssh://[email protected]/iinov/RXml.git`
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
error: failed to get `rxml` as a dependency of package `ciotantivy v0.10.0 (/home/pascal/LinuxData/Development/CioTantivy)`

Can you provide a minimal example that demonstrates the slowdown?

cargo run index demo /directory/of/files --no-thumbs --force

This is running in debug mode without --release.

@iinov
Copy link
Author

iinov commented Sep 1, 2023

Sorry for the dependency.

Here is a version without RXml : CioTantivyLite.tar.gz

I have the same problem with the 0.21.0 as with the main branch: very slow indexing compared to version 0.20.2.

@PSeitz
Copy link
Contributor

PSeitz commented Sep 1, 2023

Still not compiling

error: failed to run custom build command for `graphicsmagick-sys v0.6.0`

Caused by:
  process didn't exit successfully: `/home/pascal/LinuxData/Development/CioTantivyLite/target/debug/build/graphicsmagick-sys-61c1ed4d1786cb09/build-script-build` (exit status: 1)
  --- stdout
  cargo:rerun-if-changed=wrapper.h
  cargo:rerun-if-env-changed=GRAPHICS_MAGICK_WAND_CONFIG

  --- stderr
  Error: Run command `GraphicsMagickWand-config` failed, please check the `GraphicsMagickWand-config` is executable, or specify the environment variable `GRAPHICS_MAGICK_WAND_CONFIG` correctly.

  Caused by:
      No such file or directory (os error 2)
warning: build failed, waiting for other jobs to finish...

cargo run index demo /directory/of/files --no-thumbs --force

This is running in debug mode without --release. Do you benchmark without --release?

@adamreichold
Copy link
Collaborator

adamreichold commented Sep 1, 2023

This is running in debug mode without --release. Do you benchmark without --release?

Maybe to expand on why this is important: Even if you changed nothing but the Tantivy version, if had reasonably fast results using debug builds before, that does not mean anything. Completely unrelated code changes can be produce slower debug builds, for example because some abstraction that is completely compiled away in release builds is newly used and massively slows down debug builds.

Long story, it does not make sense to measure the performance of debug builds not even relative to each other. You must use release builds before and after the relevant change.

@iinov
Copy link
Author

iinov commented Sep 1, 2023

Maybe this can help : https://crates.io/crates/graphicsmagick

I tried in release mode. I have the same kind of difference: 0.05846s (0.20.2) vs 4.8779s (main branch).

To reproduce the slowdown, the first thing to do is to launch the server:

cargo run --release serve

Then, in another console:

cargo run --release index demo Sandbox --no-thumbs --force

Where Sandbox is the directory of data I sent you.

Then, you do the same operations but change 0.20.2 to git = "https://github.com/quickwit-oss/tantivy.git" in Cargo.toml.

I think the problem is on line 160 of src/engine/index_pass1.rs: index_writer_wlock.commit()?;

Thanks to both of you.

@adamreichold
Copy link
Collaborator

I think you indexing buffers became too small for Tantivy 0.21:

I replaced your tracing setup by

fn tracing_initialize(settings: &Settings) -> BoxDynResult<Option<WorkerGuard>> {
    use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};

    tracing_subscriber::registry()
        .with(tracing_subscriber::EnvFilter::from_default_env())
        .with(tracing_subscriber::fmt::layer())
        .init();

    Ok(None)
}

which together with 50 MB as DEFAULT_INDEX_BUFFER yields a long log showing that Tantivy commits repeatedly as it runs out of buffer space: 50-mb-buffer-multiple-commits.log

Whereas setting DEFAULT_INDEX_BUFFER to 250 MB yields a short log containing a single commit which finishes in about 100ms: 250-mb-buffer-single-commit.log

As a side remark, this has been made much more difficult for us because of the non-minimal reproducer. Especially as the problem does not depend on the context of running inside your server application, i.e. extracting some code doing the actual indexing using the same parameters would have made it easier to diagnose this.

@PSeitz
Copy link
Contributor

PSeitz commented Sep 1, 2023

Yes, we fixed some memory tracking recently with regards to fast fields, which was missing.

#2148
#2147

@iinov
Copy link
Author

iinov commented Sep 6, 2023

I have just tried to set the buffer to 250 MB, even to 200 MB, and it works as with the 20.2 release.

Many thanks for your efficiency... and sorry for the lack of simplicity in the provided application.

@iinov iinov closed this as completed Sep 6, 2023
@iinov iinov reopened this Sep 6, 2023
@iinov
Copy link
Author

iinov commented Sep 6, 2023

Sorry, I always have a problem.

Now, with 550 files I have a small difference between 20.2 (.07s) and main branch (0.10s). But, with 7500 files, I have a huge difference: 0.62s (20.2) vs 80.15s (main branch), even with a buffer of 1GB.

@PSeitz
Copy link
Contributor

PSeitz commented Sep 6, 2023

Please provide a minimal reproducible example, that includes creating an index and some docs

@iinov
Copy link
Author

iinov commented Sep 7, 2023

Here is a sample of 3645 files: Sandbox.tar.gz.

In release mode, with a buffer of 1GB (you can change it in the configuration file ciotantivy.toml, line 60), we have the following benchmark:

  • 0.20.2 = 0.2075s
  • main branch: 32.8311s

You already have a version of the application. If it is too complex for you to debug with this version, we can extract a more simple one. But, it a lot of work for us.

We have another idea. We know approximately the date on which the problem appeared. We could try to find the exact release which produces the issue.

What do you think about this?

@adamreichold
Copy link
Collaborator

Bisecting it down to a specific commit might be helpful, yes. But the whole application is definitely not a minimal reproducer.

@iinov
Copy link
Author

iinov commented Sep 8, 2023

Here is a very simple application showing the issue: CioDebug.tar.gz

The problem seems to be the amount of memory needed by the new version of Tantivy. With the 0.20.2, the standard amount of 50MB is sufficient to quickly index 5 000 documents. With the new version, even with double buffer, indexing is very slow.

  • 0.20.2, buffer = 50MB -> 0.0674s
  • main branch, buffer = 100MB -> 49,4469s

Of course, if you increase the size of the buffer, you can obtain almost the same performance as with the version 0.20.2. But, it's a never-ending race.

I hope this small application can help you to find a solution...

@adamreichold
Copy link
Collaborator

I think for the above test application, the problem is that for 100 MB buffer size, it will commit after each document, e.g.

2023-09-09T12:47:05.015812Z  INFO tantivy::indexer::index_writer: Buffer limit reached, flushing segment with maxdoc=1.    

but even increasing this to just 150 MB means only a single commit is necessary, so this seems to be based on the fixed minimum size due to the given schema, not the number of documents.

I suspect that these small defaults are just not reasonable any more (maybe due to the new columnar storage format) and significantly larger buffers are just necessary with Tantivy 0.20.x.

Also note that the old memory limits were rather inaccurate due to the memory accounting fixes linked above, i.e. the indexer actually consumed more memory than configured. So increasing the buffer size now does not really increase the memory consumption of your service, it just makes it more explicit/controllable.

@PSeitz
Copy link
Contributor

PSeitz commented Sep 9, 2023

With the fixed memory tracking of fast field buffers, the baseline memory consumption per thread is 13MB.
100MB/8 threads = 12.5MB, so it commits every document.

Setting it to 150MB fixes the issue, so your setting in ciotantivy.toml is probably broken.

I'll prepare a commit, that enforces the minimum memory per thread to be at least 15MB.

PSeitz added a commit that referenced this issue Sep 10, 2023
With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to
12MB. 7MB are for the different fast field collectors types (they could be
lazily created). Increase the minimum memory from 3MB to 15MB.

Change memory variable naming from arena to budget.

closes #2156
@iinov
Copy link
Author

iinov commented Sep 11, 2023

I understand. I increase the minimal amount of memory for the indexing buffer from 50MB to 150MB.

Thanks a lot.

@iinov iinov closed this as completed Sep 11, 2023
PSeitz added a commit that referenced this issue Sep 13, 2023
With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to
12MB. 7MB are for the different fast field collectors types (they could be
lazily created). Increase the minimum memory from 3MB to 15MB.

Change memory variable naming from arena to budget.

closes #2156
martinmr added a commit to trane-project/trane that referenced this issue Sep 18, 2023
Newer versions of tantivy require more memory during the indexing phase.
Otherwise the indexing phase will be a lot slower than in previous
versions.

See quickwit-oss/tantivy#2156 for details.
martinmr added a commit to trane-project/trane that referenced this issue Sep 18, 2023
Newer versions of tantivy require more memory during the indexing phase.
Otherwise the indexing phase will be a lot slower than in previous
versions.

See quickwit-oss/tantivy#2156 for details.
PSeitz added a commit that referenced this issue Oct 25, 2023
With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to
12MB. 7MB are for the different fast field collectors types (they could be
lazily created). Increase the minimum memory from 3MB to 15MB.

Change memory variable naming from arena to budget.

closes #2156
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants