Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent errors creating a collection #187

Open
jlkravitz opened this issue Sep 6, 2024 · 3 comments
Open

Intermittent errors creating a collection #187

jlkravitz opened this issue Sep 6, 2024 · 3 comments

Comments

@jlkravitz
Copy link

jlkravitz commented Sep 6, 2024

Hi, I am getting intermittent, inconsistent errors creating a collection. Any insight or assistance appreciated. Thank you in advance for your assistance!

Update: The issue seems to go away when I change the shard number from 4 to 2.

Issues

When I first run the below script, when the collection does not exist, I get Error: Error in the response: The operation was cancelled Timeout expired. Despite that, the qdrant_storage/MY_COLLECTION folder exists, and it shows up in the dashboard.

When I run it again, I get

collection exists, deleting...
Error: Error in the response: The operation was cancelled Timeout expired

I have also encountered issues where the client does not think the collection exists (so does not try deleting it), but when it tries to create it, I get errors like

Error in the response: Client specified an invalid argument Wrong input: Can't create collection with name MY_COLLECTION. Collection data already exists at ./storage/collections/MY_COLLECTION

Sometimes, it works fine, but it's unclear what conditions lead to success.

Code

main.rs

use anyhow::{Context, Result};
use qdrant_client::qdrant::{
    CreateCollectionBuilder, Datatype, Distance, OptimizersConfigDiffBuilder, VectorParamsBuilder,
};
use qdrant_client::Qdrant;

const QDRANT_CONNECTION_URL: &str = "http://localhost:6334";

#[tokio::main]
async fn main() -> Result<()> {
    let client = Qdrant::from_url(QDRANT_CONNECTION_URL)
        .build()
        .context("Failed to build Qdrant Client")?;
    create_collection(&client, "MY_COLLECTION").await?;
    Ok(())
}

pub async fn create_collection(client: &Qdrant, name: &str) -> Result<()> {
    if client.collection_exists(name).await? {
        println!("collection exists, deleting...");
        client.delete_collection(name).await?;
    }

    client
        .create_collection(
            CreateCollectionBuilder::new(name)
                .vectors_config(
                    VectorParamsBuilder::new(1280, Distance::Cosine)
                        .datatype(Datatype::Float16)
                        .on_disk(true),
                )
                .optimizers_config(OptimizersConfigDiffBuilder::default().indexing_threshold(0))
                .shard_number(4),
        )
        .await?;

    Ok(())
}

Cargo.toml

[package]
name = "upload-issue"
version = "0.1.0"
edition = "2021"

[dependencies]
anyhow = "1.0.83"
qdrant-client = "1.11.2"
tokio = { version = "1.37.0", features = ["rt-multi-thread"] }
@timvisee
Copy link
Member

timvisee commented Sep 9, 2024

Is it possible for your storage is slow? If using slow storage, creating more shards on disk would take more time, which might be causing the timeout.

Could you elaborate on your cluster set up a bit? Ideally the following:

  • storage configuration (kind of storage)
  • achievable IOPS on the storage (ideally by doing a benchmark)
  • number of nodes in the cluster (if more than 1)

The error message you show above is definitely a bug. You should never be able to get into a state where both deleting and creating a collection fails. Hopefully the above gives us a bit more insight.

Note that I don't think this is specifically related to the Rust client, but rather to Qdrant itself.

@jlkravitz
Copy link
Author

jlkravitz commented Sep 9, 2024

Hi @timvisee It's possible this is related to Qdrant, but I recall being able to run the commands successfully through the API directly. I will explore and confirm that, if that would be helpful. If you'd like me to move this issue to Qdrant, I can do that, too.

Answers below to your questions!

Storage Configuration

Our qdrant data is stored in /mnt/raid0.

  1. Storage Type: RAID0 Array (/dev/md127):
    • Size: 43.94 TiB
    • File System: ext4
    • Mount Point: /mnt/raid0
    • RAID Level: RAID0 (striping)
  2. Component Disks in the RAID0 Array:
    • /dev/vdd: 10.74 TiB (Rotational Disk)
    • /dev/vde: 13.67 TiB (Rotational Disk)
    • /dev/vdf: 14.65 TiB (Rotational Disk)
    • /dev/vdg: 4.88 TiB (Rotational Disk)

IOPS

Running a fio benchmark gives me 164,000 IOPS (sequential write) and 133,000 IOPS (random write).

Full outputs below.

Sequential write benchmark
sudo fio --name=seqwrite --directory=/mnt/raid0/test-benchmark --ioengine=libaio --rw=write --bs=4k --numjobs=4 --size=1G --runtime=60 --time_based --group_reporting
seqwrite: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.28
Starting 4 processes
seqwrite: Laying out IO file (1 file / 1024MiB)
seqwrite: Laying out IO file (1 file / 1024MiB)
seqwrite: Laying out IO file (1 file / 1024MiB)
seqwrite: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [_(2),W(1),_(1)][100.0%][w=8KiB/s][w=2 IOPS][eta 00m:00s]
seqwrite: (groupid=0, jobs=4): err= 0: pid=3786002: Mon Sep  9 12:37:14 2024
  write: IOPS=164k, BW=641MiB/s (672MB/s)(39.0GiB/62302msec); 0 zone resets
    slat (usec): min=2, max=56956, avg= 5.02, stdev=27.95
    clat (nsec): min=344, max=2768.7k, avg=406.31, stdev=1003.99
     lat (usec): min=2, max=56970, avg= 5.53, stdev=27.98
    clat percentiles (nsec):
     |  1.00th=[  354],  5.00th=[  358], 10.00th=[  362], 20.00th=[  366],
     | 30.00th=[  370], 40.00th=[  374], 50.00th=[  394], 60.00th=[  398],
     | 70.00th=[  402], 80.00th=[  406], 90.00th=[  414], 95.00th=[  482],
     | 99.00th=[  948], 99.50th=[ 1048], 99.90th=[ 1336], 99.95th=[ 1640],
     | 99.99th=[17280]
   bw (  MiB/s): min=  117, max= 2892, per=100.00%, avg=2032.45, stdev=222.19, samples=157
   iops        : min=30121, max=740442, avg=520305.89, stdev=56881.14, samples=157
  lat (nsec)   : 500=95.65%, 750=2.52%, 1000=1.15%
  lat (usec)   : 2=0.64%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=3.79%, sys=32.84%, ctx=16887, majf=0, minf=54
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10223620,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=641MiB/s (672MB/s), 641MiB/s-641MiB/s (672MB/s-672MB/s), io=39.0GiB (41.9GB), run=62302-62302msec

Disk stats (read/write):
    md127: ios=0/122261, merge=0/0, ticks=0/11423608, in_queue=11423608, util=87.36%, aggrios=0/21458, aggrmerge=0/9117, aggrticks=0/1961161, aggrin_queue=1961191, aggrutil=86.22%
  vdf: ios=0/42905, merge=0/18168, ticks=0/3906814, in_queue=3906833, util=86.09%
  vdd: ios=0/34, merge=0/59, ticks=0/944, in_queue=1005, util=1.55%
  vdg: ios=0/10, merge=0/0, ticks=0/17, in_queue=32, util=0.10%
  vde: ios=0/42884, merge=0/18241, ticks=0/3936871, in_queue=3936895, util=86.22%
Random write benchmark
sudo fio --name=benchmark --directory=/mnt/raid0/test-benchmark --ioengine=libaio --rw=randwrite --bs=4k --numjobs=4 --size=1G --runtime=60 --time_based --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.28
Starting 4 processes
benchmark: Laying out IO file (1 file / 1024MiB)
benchmark: Laying out IO file (1 file / 1024MiB)
benchmark: Laying out IO file (1 file / 1024MiB)
benchmark: Laying out IO file (1 file / 1024MiB)
Jobs: 4 (f=4): [w(4)][100.0%][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=3783988: Mon Sep  9 12:28:35 2024
  write: IOPS=133k, BW=519MiB/s (545MB/s)(32.0GiB/63087msec); 0 zone resets
    slat (usec): min=2, max=7302, avg= 5.43, stdev= 7.26
    clat (nsec): min=347, max=471828, avg=412.18, stdev=310.18
     lat (usec): min=2, max=7312, avg= 5.95, stdev= 7.30
    clat percentiles (nsec):
     |  1.00th=[  358],  5.00th=[  366], 10.00th=[  366], 20.00th=[  370],
     | 30.00th=[  378], 40.00th=[  394], 50.00th=[  402], 60.00th=[  406],
     | 70.00th=[  410], 80.00th=[  414], 90.00th=[  422], 95.00th=[  482],
     | 99.00th=[  948], 99.50th=[ 1064], 99.90th=[ 1128], 99.95th=[ 1352],
     | 99.99th=[16512]
   bw (  MiB/s): min=  160, max= 2614, per=100.00%, avg=1955.49, stdev=180.79, samples=134
   iops        : min=41043, max=669214, avg=500604.44, stdev=46282.85, samples=134
  lat (nsec)   : 500=95.58%, 750=2.84%, 1000=0.89%
  lat (usec)   : 2=0.67%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%
  cpu          : usr=3.46%, sys=32.04%, ctx=6230, majf=0, minf=47
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8388612,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=519MiB/s (545MB/s), 519MiB/s-519MiB/s (545MB/s-545MB/s), io=32.0GiB (34.4GB), run=63087-63087msec

Disk stats (read/write):
    md127: ios=0/82489, merge=0/0, ticks=0/7297240, in_queue=7297240, util=75.21%, aggrios=0/19970, aggrmerge=0/661, aggrticks=0/1739884, aggrin_queue=1739913, aggrutil=73.85%
  vdf: ios=0/39885, merge=0/1326, ticks=0/3462105, in_queue=3462149, util=73.85%
  vdd: ios=0/14, merge=0/3, ticks=0/571, in_queue=596, util=0.57%
  vdg: ios=0/9, merge=0/0, ticks=0/21, in_queue=43, util=0.09%
  vde: ios=0/39972, merge=0/1316, ticks=0/3496840, in_queue=3496865, util=73.59%

Number of nodes

We just have one node.

@jlkravitz
Copy link
Author

@timvisee Just following up here. Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants