Upserts partially committed during node restart #4626

tekumara · 2024-07-06T12:36:10Z

In a 3-node cluster with replication factor 3, during upserts (via batch updates with ordering weak or strong):

One node restarts.
Upserts fail because write consistency != number of replicas ie: 3 != 2 (this is as expected)
The other active nodes commit the upsert.
Restarted node rejoins but doesn't sync the missed upsert.

Result: data inconsistency across nodes, ie: missing points on the restarted node.

Steps to Reproduce

Checkout tekumara/qdrant-demo and install prerequisites
Install using make all
Run make restart-with-upserts

Run make healthcheck and observe different counts across the nodes. eg:

 ❯ make restart-with-upserts
 infra/perf/perf-restart-check.sh upsert
 Start upsert workload
 
           /\      |‾‾| /‾‾/   /‾‾/   
      /\  /  \     |  |/  /   /  /    
     /  \/    \    |     (   /   ‾‾\  
    /          \   |  |\  \ |  (‾)  | 
   / __________ \  |__| \__\ \_____/ .io
 
      execution: local
         script: infra/perf/k6.js
         output: -
 
      scenarios: (100.00%) 1 scenario, 1 max VUs, 1m30s max duration (incl. graceful stop):
               * default: 1 looping VUs for 1m0s (gracefulStop: 30s)
 
 INFO[0001] deleting existing collection k6-perf-test     source=console
 INFO[0001] delete existing collection k6-perf-test: {"result":true,"status":"ok","time":0.079671625}  source=console
 pod "qdrant-1" deleted
 ERRO[0014] GoError: http status: 500
 {"status":{"error":"Service internal error: 1 out of 3 shards failed to apply operation. First error captured: Service internal error: Failed to apply update with Strong ordering via leader peer 4681542363249938: Timeout error: Deadline Exceeded: status: DeadlineExceeded, message: \"Timeout: Timeout error: Deadline Exceeded: status: DeadlineExceeded, message: \\\"Healthcheck timeout 2000ms exceeded\\\", details: [], metadata: MetadataMap { headers: {} }\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Sat, 06 Jul 2024 12:28:04 GMT\", \"content-length\": \"0\"} }"},"time":6.789591462}
         at go.k6.io/k6/js/modules/k6.(*K6).Fail-fm (native)
         at file:///Users/oliver.mannion/code/qdrant-demo/infra/perf/k6.js:175:7(124)  executor=constant-vus scenario=default source=stacktrace
 stamina.retry_scheduled
 stamina.retry_scheduled
 count0=3840
 count1=554622.1s), 1/1 VUs, 455 complete and 0 interrupted iterations
 count2=5558===========>--------------------------] 1 VUs  0m19.5s/1m0s
 Counts not equal
 count0=6060
 count1=605723.5s), 1/1 VUs, 504 complete and 0 interrupted iterations
 count2=6070============>-------------------------] 1 VUs  0m20.9s/1m0s
 Counts not equal
 
      ✗ batch update points - status is 200
       ↳  99% — ✓ 512 / ✗ 1
      ✗ batch update points - is OK
       ↳  99% — ✓ 512 / ✗ 1
      ✗ batch update points - completed
       ↳  99% — ✓ 512 / ✗ 1
 
      █ setup
 
        ✓ create collection - status is 200
        ✓ create collection - is OK
        ✓ add points - status is 200
        ✓ add points - is OK
        ✓ add points - completed
 
      checks.........................: 99.81% ✓ 1658      ✗ 3  
      data_received..................: 172 kB 7.2 kB/s
      data_sent......................: 182 MB 7.7 MB/s
      http_req_blocked...............: avg=5.93µs   min=2µs     med=4µs     max=688µs   p(90)=5µs     p(95)=6µs     
      http_req_connecting............: avg=1.25µs   min=0s      med=0s      max=373µs   p(90)=0s      p(95)=0s      
      http_req_duration..............: avg=25.05ms  min=7.4ms   med=9.68ms  max=6.8s    p(90)=17.27ms p(95)=23.99ms 
        { expected_response:true }...: avg=12.83ms  min=7.4ms   med=9.68ms  max=445.6ms p(90)=17.16ms p(95)=23.8ms  
      http_req_failed................: 0.17%  ✓ 1         ✗ 555
      http_req_receiving.............: avg=81.88µs  min=36µs    med=57µs    max=4.91ms  p(90)=89µs    p(95)=112µs   
      http_req_sending...............: avg=145.47µs min=13µs    med=106µs   max=3.81ms  p(90)=167.5µs p(95)=196.24µs
      http_req_tls_handshaking.......: avg=0s       min=0s      med=0s      max=0s      p(90)=0s      p(95)=0s      
      http_req_waiting...............: avg=24.82ms  min=7.25ms  med=9.49ms  max=6.8s    p(90)=17.1ms  p(95)=23.81ms 
      http_reqs......................: 556    23.425145/s
      iteration_duration.............: avg=46.13ms  min=21.56ms med=24.77ms max=6.82s   p(90)=35.24ms p(95)=42.42ms 
      iterations.....................: 513    21.613488/s
      vus............................: 1      min=0       max=1
      vus_max........................: 1      min=1       max=1
 
 
 running (0m23.7s), 0/1 VUs, 513 complete and 1 interrupted iterations
 default ✗ [============>-------------------------] 1 VUs  0m21.1s/1m0s
 ERRO[0025] test run was aborted because k6 received a 'terminated' signal 
 k6 stopped
 make: *** [restart-with-upserts] Error 1
 ❯ make healthcheck
 .venv/bin/python -m src.demo.healthcheck
 count0=6140
 count1=6136
 count2=6140
 empty0=[]
 empty1=[]
 empty2=[]

Expected Behavior

A node can be restarted during upserts, and when it rejoins the qdrant cluster it should be consistent with the rest of the cluster.

Context (Environment)

Observed in production because qdrant node restarts are not uncommon on kubernetes and may occur during a rolling upgrade of the pods, or because they are OOMKilled, or general kubernetes rescheduling/maintenance.

qdrant 1.10.0

The text was updated successfully, but these errors were encountered:

generall · 2024-07-06T12:57:56Z

Hey @tekumara, since

Upserts fail because write consistency

qdrant doesn't not guarantee consistency of this specific operation. Qdrant expects external process to re-apply the operation and fix inconsistency without performing potentially expensive data sync.

Internally, consistency is only guaranteed if the operation was accepted.

If the write_consistency_factor is low enough, the failed nodes will be marked as dead and re-synced on restart. So i believe, that you can achieve your desired behavior by using lower write_consistency_factor.

tekumara · 2024-07-06T13:21:05Z

Oh I see, so I think what I expected here but didn't make clear, is when the write fails to apply to write_consistency_factor number of replicas and returns a failure to the client, then it shouldn't be committed to any replica (rather than leave a subset of nodes with the write)

Write operations will fail if the number of active replicas is less than the write_consistency_factor

generall · 2024-07-06T13:26:29Z

then it shouldn't be committed to any replica (rather than leave a subset of nodes with the write)

This would require either two-phase commit schema, or sequential writes. Both options would likely damage performance, so we decided against it.

tekumara · 2024-07-07T11:25:55Z

I can confirm that with write_consistency_factor = 2 the operation is accepted when 2 of 3 nodes are available (as mentioned here) and, more importantly, the restarted node rejoins the cluster with a consistent set of points, thank you!

Internally, consistency is only guaranteed if the operation was accepted.

As a suggestion, could the docs could be updated with something along these lines (unless I've missed it somewhere)?

to ensure consistency when 1 node restarts see qdrant/qdrant#4626 (comment)

timvisee · 2024-07-08T07:33:16Z

As a suggestion, could the docs could be updated with something along these lines (unless I've missed it somewhere)?

@tekumara I've added the following: qdrant/landing_page#1013

Please feel free to leave a review.

tekumara added the bug Something isn't working label Jul 6, 2024

generall closed this as completed Jul 6, 2024

tekumara added a commit to tekumara/qdrant-demo that referenced this issue Jul 7, 2024

fix(perf): write_consistency_factor=2

ebd2eb7

to ensure consistency when 1 node restarts see qdrant/qdrant#4626 (comment)

timvisee mentioned this issue Jul 8, 2024

Extend write consistency factor text, reapply on client vs lower factor qdrant/landing_page#1013

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upserts partially committed during node restart #4626

Upserts partially committed during node restart #4626

tekumara commented Jul 6, 2024

generall commented Jul 6, 2024

tekumara commented Jul 6, 2024 •

edited

Loading

generall commented Jul 6, 2024

tekumara commented Jul 7, 2024

timvisee commented Jul 8, 2024

Upserts partially committed during node restart #4626

Upserts partially committed during node restart #4626

Comments

tekumara commented Jul 6, 2024

Steps to Reproduce

Expected Behavior

Context (Environment)

generall commented Jul 6, 2024

tekumara commented Jul 6, 2024 • edited Loading

generall commented Jul 6, 2024

tekumara commented Jul 7, 2024

timvisee commented Jul 8, 2024

tekumara commented Jul 6, 2024 •

edited

Loading