Slow indexing #2156

iinov · 2023-08-28T07:30:48Z

Describe the bug
In recent days, indexing has become very slow. If I use the release 20.2, it takes 0.07s to index 550 files. If I use the main branch, it takes about 5s to index same files!

Which version of tantivy are you using?
master vs 20.2

To Reproduce
Index few files with le main branch vs the last stable release.

adamreichold · 2023-08-28T07:57:22Z

Are you sure that the is no alternative explanation like debug versus release builds?

Personally, I cannot reproduce this as indexing about 25.000 documents yields

> RAYON_NUM_THREADS=1 DATA_PATH=data hyperfine ./indexer-0.20.2 ./indexer-master 
Benchmark 1: ./indexer-0.20.2
  Time (mean ± σ):     482.3 ms ±   8.4 ms    [User: 1952.9 ms, System: 552.3 ms]
  Range (min … max):   470.8 ms … 494.1 ms    10 runs
 
Benchmark 2: ./indexer-master
  Time (mean ± σ):     482.0 ms ±   5.6 ms    [User: 1945.3 ms, System: 555.2 ms]
  Range (min … max):   475.6 ms … 493.1 ms    10 runs
 
Summary
  ./indexer-master ran
    1.00 ± 0.02 times faster than ./indexer-0.20.2

where the only difference is whether the Tantivy dependency is version = "0.20" versus git = "https://github.com/quickwit-oss/tantivy.git".

iinov · 2023-08-28T08:51:40Z

I don't think so. The only change I made, is in the Cargo.toml file:

tantivy = "0.20.2" -> tantivy = { git = "https://github.com/quickwit-oss/tantivy.git" }

When I follow in the code the time consumption, the difference is on this command:

   // Commit
   let mut index_writer_wlock = self.index_writer.write().await;
   index_writer_wlock.commit()?;

adamreichold · 2023-08-28T09:00:47Z

Is it on the commit on or on acquiring the lock?

PSeitz · 2023-08-28T09:57:02Z

@iinov Can you provide something to reproduce?
Or run it with flamegraph, so we can see where the time is spent.

iinov · 2023-08-28T12:21:34Z

Here is a little benchmark:

   let now = Instant::now();
   let mut index_writer_wlock = self.index_writer.write().await;
   info!("locked: {}s", now.elapsed().as_secs_f32());
   index_writer_wlock.commit()?;
   info!("commited: {}s", now.elapsed().as_secs_f32());

With 20.2:

2023-08-28T12:16:14.021024Z INFO index_pass1{tid="demo"}: locked: 0.00000593s
2023-08-28T12:16:14.080967Z INFO index_pass1{tid="demo"}: commited: 0.05994658s

With main branch:

2023-08-28T12:14:27.494617Z INFO index_pass1{tid="demo"}: locked: 0.00000541s
2023-08-28T12:14:33.189301Z INFO index_pass1{tid="demo"}: commited: 5.694685s

iinov · 2023-08-28T12:49:25Z

Here is the flamegraph SVG.

PSeitz · 2023-08-28T13:54:21Z

The flamegraph looks regular

Do you have a full example to reproduce?

iinov · 2023-08-28T20:19:39Z

Here is the application: CioTantivy.tar.gz

You can run the server like this:

cargo run serve

And then, in another console, create an an index:

cargo run index demo /directory/of/files --no-thumbs --force

Thank you.

PSeitz · 2023-08-29T06:18:20Z

A example to reproduce should include code that creates the index and the data.

iinov · 2023-08-29T07:18:23Z

You already have the code.

Some data: Sandbox.tar.gz

cargo run index demo Sandbox --no-thumbs --force

iinov · 2023-08-31T06:38:11Z

May be this information can help you.

I looked in my log and saw that I'd gone from main branch to 20.2, to workaround the problem, on August 21. Another thing, you can focus on line 160 of src/engine/index_pass1.rs to see the slowdown.

Thank you for your help.

PSeitz · 2023-08-31T07:11:27Z

Updating git repository `ssh://[email protected]/iinov/RXml.git`
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
error: failed to get `rxml` as a dependency of package `ciotantivy v0.10.0 (/home/pascal/LinuxData/Development/CioTantivy)`

Can you provide a minimal example that demonstrates the slowdown?

cargo run index demo /directory/of/files --no-thumbs --force

This is running in debug mode without --release.

iinov · 2023-09-01T07:27:37Z

Sorry for the dependency.

Here is a version without RXml : CioTantivyLite.tar.gz

I have the same problem with the 0.21.0 as with the main branch: very slow indexing compared to version 0.20.2.

PSeitz · 2023-09-01T07:35:28Z

Still not compiling

error: failed to run custom build command for `graphicsmagick-sys v0.6.0`

Caused by:
  process didn't exit successfully: `/home/pascal/LinuxData/Development/CioTantivyLite/target/debug/build/graphicsmagick-sys-61c1ed4d1786cb09/build-script-build` (exit status: 1)
  --- stdout
  cargo:rerun-if-changed=wrapper.h
  cargo:rerun-if-env-changed=GRAPHICS_MAGICK_WAND_CONFIG

  --- stderr
  Error: Run command `GraphicsMagickWand-config` failed, please check the `GraphicsMagickWand-config` is executable, or specify the environment variable `GRAPHICS_MAGICK_WAND_CONFIG` correctly.

  Caused by:
      No such file or directory (os error 2)
warning: build failed, waiting for other jobs to finish...

cargo run index demo /directory/of/files --no-thumbs --force

This is running in debug mode without --release. Do you benchmark without --release?

adamreichold · 2023-09-01T08:07:31Z

This is running in debug mode without --release. Do you benchmark without --release?

Maybe to expand on why this is important: Even if you changed nothing but the Tantivy version, if had reasonably fast results using debug builds before, that does not mean anything. Completely unrelated code changes can be produce slower debug builds, for example because some abstraction that is completely compiled away in release builds is newly used and massively slows down debug builds.

Long story, it does not make sense to measure the performance of debug builds not even relative to each other. You must use release builds before and after the relevant change.

iinov · 2023-09-01T08:32:24Z

Maybe this can help : https://crates.io/crates/graphicsmagick

I tried in release mode. I have the same kind of difference: 0.05846s (0.20.2) vs 4.8779s (main branch).

To reproduce the slowdown, the first thing to do is to launch the server:

cargo run --release serve

Then, in another console:

cargo run --release index demo Sandbox --no-thumbs --force

Where Sandbox is the directory of data I sent you.

Then, you do the same operations but change 0.20.2 to git = "https://github.com/quickwit-oss/tantivy.git" in Cargo.toml.

I think the problem is on line 160 of src/engine/index_pass1.rs: index_writer_wlock.commit()?;

Thanks to both of you.

adamreichold · 2023-09-01T09:04:54Z

I think you indexing buffers became too small for Tantivy 0.21:

I replaced your tracing setup by

fn tracing_initialize(settings: &Settings) -> BoxDynResult<Option<WorkerGuard>> {
    use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};

    tracing_subscriber::registry()
        .with(tracing_subscriber::EnvFilter::from_default_env())
        .with(tracing_subscriber::fmt::layer())
        .init();

    Ok(None)
}

which together with 50 MB as DEFAULT_INDEX_BUFFER yields a long log showing that Tantivy commits repeatedly as it runs out of buffer space: 50-mb-buffer-multiple-commits.log

Whereas setting DEFAULT_INDEX_BUFFER to 250 MB yields a short log containing a single commit which finishes in about 100ms: 250-mb-buffer-single-commit.log

As a side remark, this has been made much more difficult for us because of the non-minimal reproducer. Especially as the problem does not depend on the context of running inside your server application, i.e. extracting some code doing the actual indexing using the same parameters would have made it easier to diagnose this.

PSeitz · 2023-09-01T09:21:00Z

Yes, we fixed some memory tracking recently with regards to fast fields, which was missing.

#2148
#2147

iinov · 2023-09-06T09:59:16Z

I have just tried to set the buffer to 250 MB, even to 200 MB, and it works as with the 20.2 release.

Many thanks for your efficiency... and sorry for the lack of simplicity in the provided application.

iinov · 2023-09-06T10:46:56Z

Sorry, I always have a problem.

Now, with 550 files I have a small difference between 20.2 (.07s) and main branch (0.10s). But, with 7500 files, I have a huge difference: 0.62s (20.2) vs 80.15s (main branch), even with a buffer of 1GB.

PSeitz · 2023-09-06T11:50:11Z

Please provide a minimal reproducible example, that includes creating an index and some docs

iinov · 2023-09-07T08:17:18Z

Here is a sample of 3645 files: Sandbox.tar.gz.

In release mode, with a buffer of 1GB (you can change it in the configuration file ciotantivy.toml, line 60), we have the following benchmark:

0.20.2 = 0.2075s
main branch: 32.8311s

You already have a version of the application. If it is too complex for you to debug with this version, we can extract a more simple one. But, it a lot of work for us.

We have another idea. We know approximately the date on which the problem appeared. We could try to find the exact release which produces the issue.

What do you think about this?

adamreichold · 2023-09-07T11:19:04Z

Bisecting it down to a specific commit might be helpful, yes. But the whole application is definitely not a minimal reproducer.

iinov · 2023-09-08T08:37:15Z

Here is a very simple application showing the issue: CioDebug.tar.gz

The problem seems to be the amount of memory needed by the new version of Tantivy. With the 0.20.2, the standard amount of 50MB is sufficient to quickly index 5 000 documents. With the new version, even with double buffer, indexing is very slow.

0.20.2, buffer = 50MB -> 0.0674s
main branch, buffer = 100MB -> 49,4469s

Of course, if you increase the size of the buffer, you can obtain almost the same performance as with the version 0.20.2. But, it's a never-ending race.

I hope this small application can help you to find a solution...

adamreichold · 2023-09-09T12:50:23Z

I think for the above test application, the problem is that for 100 MB buffer size, it will commit after each document, e.g.

2023-09-09T12:47:05.015812Z  INFO tantivy::indexer::index_writer: Buffer limit reached, flushing segment with maxdoc=1.

but even increasing this to just 150 MB means only a single commit is necessary, so this seems to be based on the fixed minimum size due to the given schema, not the number of documents.

I suspect that these small defaults are just not reasonable any more (maybe due to the new columnar storage format) and significantly larger buffers are just necessary with Tantivy 0.20.x.

Also note that the old memory limits were rather inaccurate due to the memory accounting fixes linked above, i.e. the indexer actually consumed more memory than configured. So increasing the buffer size now does not really increase the memory consumption of your service, it just makes it more explicit/controllable.

PSeitz · 2023-09-09T17:41:39Z

With the fixed memory tracking of fast field buffers, the baseline memory consumption per thread is 13MB.
100MB/8 threads = 12.5MB, so it commits every document.

Setting it to 150MB fixes the issue, so your setting in ciotantivy.toml is probably broken.

I'll prepare a commit, that enforces the minimum memory per thread to be at least 15MB.

With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to 12MB. 7MB are for the different fast field collectors types (they could be lazily created). Increase the minimum memory from 3MB to 15MB. Change memory variable naming from arena to budget. closes #2156

iinov · 2023-09-11T07:34:40Z

I understand. I increase the minimal amount of memory for the indexing buffer from 50MB to 150MB.

Thanks a lot.

With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to 12MB. 7MB are for the different fast field collectors types (they could be lazily created). Increase the minimum memory from 3MB to 15MB. Change memory variable naming from arena to budget. closes #2156

Newer versions of tantivy require more memory during the indexing phase. Otherwise the indexing phase will be a lot slower than in previous versions. See quickwit-oss/tantivy#2156 for details.

With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to 12MB. 7MB are for the different fast field collectors types (they could be lazily created). Increase the minimum memory from 3MB to 15MB. Change memory variable naming from arena to budget. closes #2156

iinov closed this as completed Sep 6, 2023

iinov reopened this Sep 6, 2023

adamreichold mentioned this issue Sep 9, 2023

Slow writer performance with the current default heap size quickwit-oss/tantivy-py#118

Closed

PSeitz mentioned this issue Sep 10, 2023

increase min memory to 15MB for indexing #2176

Merged

iinov closed this as completed Sep 11, 2023

martinmr mentioned this issue Sep 18, 2023

Improve performance due to slow indexing trane-project/trane#273

Merged

This was referenced Nov 10, 2023

[deployment architecture] general questions about how to run the scheduler TabbyML/tabby#752

Closed

fix: tantivy requires more memory for indexing in new version. TabbyML/tabby#753

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow indexing #2156

Slow indexing #2156

iinov commented Aug 28, 2023

adamreichold commented Aug 28, 2023 •

edited

Loading

iinov commented Aug 28, 2023

adamreichold commented Aug 28, 2023

PSeitz commented Aug 28, 2023

iinov commented Aug 28, 2023

iinov commented Aug 28, 2023

PSeitz commented Aug 28, 2023

iinov commented Aug 28, 2023

PSeitz commented Aug 29, 2023

iinov commented Aug 29, 2023

iinov commented Aug 31, 2023

PSeitz commented Aug 31, 2023

iinov commented Sep 1, 2023

PSeitz commented Sep 1, 2023

adamreichold commented Sep 1, 2023 •

edited

Loading

iinov commented Sep 1, 2023

adamreichold commented Sep 1, 2023

PSeitz commented Sep 1, 2023

iinov commented Sep 6, 2023

iinov commented Sep 6, 2023 •

edited

Loading

PSeitz commented Sep 6, 2023

iinov commented Sep 7, 2023

adamreichold commented Sep 7, 2023

iinov commented Sep 8, 2023

adamreichold commented Sep 9, 2023

PSeitz commented Sep 9, 2023

iinov commented Sep 11, 2023

Slow indexing #2156

Slow indexing #2156

Comments

iinov commented Aug 28, 2023

adamreichold commented Aug 28, 2023 • edited Loading

iinov commented Aug 28, 2023

adamreichold commented Aug 28, 2023

PSeitz commented Aug 28, 2023

iinov commented Aug 28, 2023

iinov commented Aug 28, 2023

PSeitz commented Aug 28, 2023

iinov commented Aug 28, 2023

PSeitz commented Aug 29, 2023

iinov commented Aug 29, 2023

iinov commented Aug 31, 2023

PSeitz commented Aug 31, 2023

iinov commented Sep 1, 2023

PSeitz commented Sep 1, 2023

adamreichold commented Sep 1, 2023 • edited Loading

iinov commented Sep 1, 2023

adamreichold commented Sep 1, 2023

PSeitz commented Sep 1, 2023

iinov commented Sep 6, 2023

iinov commented Sep 6, 2023 • edited Loading

PSeitz commented Sep 6, 2023

iinov commented Sep 7, 2023

adamreichold commented Sep 7, 2023

iinov commented Sep 8, 2023

adamreichold commented Sep 9, 2023

PSeitz commented Sep 9, 2023

iinov commented Sep 11, 2023

adamreichold commented Aug 28, 2023 •

edited

Loading

adamreichold commented Sep 1, 2023 •

edited

Loading

iinov commented Sep 6, 2023 •

edited

Loading