Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Query fail when table has more than X number of rows #130

Open
Drzhivago264 opened this issue Aug 29, 2024 · 2 comments
Open

Bug: Query fail when table has more than X number of rows #130

Drzhivago264 opened this issue Aug 29, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Drzhivago264
Copy link

Drzhivago264 commented Aug 29, 2024

Hello,
I am testing pgvectorscale and face very strange bug when querying
When the table has a small number of row, everything seem to work as expected, but when it has large number of row it failed to order by distance. More specifically, it will start ordering with distance > 0.4 which is very strange. It seems to me when the row get large the ANN index thing kick in and somehow fail to do the job.

There is a very high change of me failling to install the extension properly, can you please make a better documentation and show how to test the thing.

Here is the way to reproduce it if it is possible:
Insert:

  embedding_fn = SentenceTransformer("Snowflake/snowflake-arctic-embed-xs") #this model output 384 dimensions vectors
  docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True)
  dataset_1 = Dataset.objects.get(id=138)
  dataset_2 = Dataset.objects.get(id=140)
  for index, doc in enumerate(docs):
      if index < 5000: #you can try with 20000, 30000 ..., on my computer it beaks somewhere between 5000 and 10000 vectors
          title = doc['title']
          text = doc['text']
          embedding = embedding_fn.encode( f"{title}\n{text}", convert_to_numpy=True)
          EmbeddingDatasetRecord.objects.create(
              dataset=dataset_1,
              system_prompt=title,
              prompt=text,
              response=text,
              evaluation=list(),
              embedding=embedding #this is the VectorField of pgvector
          )
          EmbeddingDatasetRecord.objects.create(
              dataset=dataset_2,
              system_prompt=title,
              prompt=text,
              response=text,
              evaluation=list(),
              embedding=embedding
          )

Query:
SELECT * FROM (SELECT *, embedding <=> '[0.02193091,0.042891648,-0.03889269,-0.010822805,0.016204676,0.020508468,0.06264098,0.012757077,0.03802473,0.021543836,-0.0015017532,0.026285868,0.02930233,-0.07618598,0.0066387067,0.011965876,-0.00090104126,0.03431576,-0.1434653,-0.013353891,-0.0142628085,0.08972244,0.015904268,0.03440797,-0.014902867,-0.049964894,-0.007177208,-0.018914903,0.058478113,-0.16571018,0.07821424,-0.027141916,-0.06959594,0.021689195,-0.033039533,-0.037112832,-0.04045044,0.00524931,-0.075830236,0.0022174555,0.02905049,0.025736835,-0.08581663,-0.07312632,0.019102544,0.018688887,-0.036840633,0.0346396,0.071704365,-0.000683156,0.050018813,-0.036825076,-0.07345006,-0.00046478264,-0.07458941,-0.050934426,0.0074105645,-0.06144636,-0.0018082917,-0.019732352,-0.0044434587,0.032660086,-0.30123818,0.07985681,0.05700144,-0.061010137,0.08058075,0.038523242,-0.02049733,0.03476265,-0.038952805,0.01703532,-0.039286923,0.005417129,0.0035713362,-0.015754562,-0.018252777,-0.038186453,0.061294436,-0.009988788,-0.0077952193,0.018940236,0.074857995,-0.024276216,-0.06611401,0.067232944,-0.009044542,0.018880283,0.022442453,0.03348694,0.03546497,-0.028508255,0.015439893,-0.0042260466,-0.10470915,0.009079082,-0.018914754,-0.060381897,-0.039463013,0.260882,0.021050291,0.0069140904,0.09679024,-0.025222883,0.07485987,-0.05462765,-0.03691912,-0.010405504,0.0053140204,0.01349774,0.025099061,-0.0352193,-0.02903889,-0.042838812,-0.014827345,0.052259702,-0.06764409,0.003731194,0.0632815,-0.09021019,-0.009536956,-0.024215234,-0.027015597,-0.05654341,0.08860586,0.017230729,0.03783848,0.063414566,0.01253628,-0.008171457,0.06352262,-0.048330598,-0.10316764,0.043611977,-0.004886594,0.023158897,0.009817951,-0.032543857,0.012578825,-0.06373767,-0.046436813,-0.07931003,0.041994076,0.006629597,0.0711623,-0.055077206,-0.08750111,0.057514813,-0.035172384,0.012883965,-0.003357827,0.10649713,-0.016432667,-0.044449482,0.028183356,0.023586249,-0.037965026,-0.021858295,-0.029453054,0.023828743,0.075109184,0.02459477,0.06253792,0.051485267,0.026215613,-0.0508289,-0.0002768098,0.018913267,0.052425623,0.01037229,0.04074772,0.0050405124,-0.009029485,0.043011006,0.03507945,0.005140146,0.03994164,0.034858752,0.02864449,-0.07465696,-0.039593488,0.011193994,-0.045223888,-0.04533181,0.000794882,-0.024049869,0.04974576,-0.008402275,0.0642412,0.0860992,-0.018448683,0.13408884,-0.05960626,-0.08313907,0.005685682,-0.037685372,-0.018976886,0.005868965,0.118482165,-0.026211828,0.08038784,0.041568056,0.044295218,8.091203e-06,-0.054107122,-0.0475231,-0.032526504,-0.026934022,0.018252552,-0.00091886945,-0.025554238,0.04645355,0.05012507,0.0073790215,-0.0018393716,-0.1142013,0.02672147,0.033314034,0.010251653,0.03698369,0.007717058,0.009067949,0.022431094,-0.083041884,-0.06455144,-0.071205676,-0.01781124,-0.003421426,0.016667016,0.0067273746,0.019494766,0.017396988,0.0022625509,-0.02314172,0.10236078,-0.008294908,-0.025979547,0.06653698,0.024540726,-0.05252432,0.021665879,0.021032317,-0.08165393,0.001279748,0.06699123,-0.026111893,0.072303295,0.07320965,-0.023221226,0.12551287,0.08093811,-0.05006531,0.11876413,0.07408422,0.013937139,-0.005613495,-0.0871829,0.030889925,-0.06221935,-0.02001347,-0.0063102837,0.044966724,0.05657282,1.8559424e-05,0.016109515,0.013424583,-0.026304198,-0.01456574,-0.012557078,-0.012552112,0.01967654,0.0287791,0.00798412,-0.0032306109,0.0067705745,0.00038342623,0.036893047,-0.011694935,0.0049635707,-0.05141802,-0.0022991092,0.014651427,-0.009943845,0.0047945664,0.018711649,-0.026190145,-0.0075145443,0.008150758,0.0052204747,0.022564378,-0.010748652,-0.026413916,-0.081454195,-0.023446059,0.036070414,-0.004768759,0.02004118,0.03641233,0.025974445,0.052120164,-0.028162874,0.0073873987,0.0043698344,-0.0035585247,-0.08682441,0.012092856,0.03718829,-0.04445666,0.0074223983,0.06156349,-0.11375585,-0.03197469,0.03175704,-0.033154037,-0.0001340805,-0.04433351,0.043988787,-0.0046226187,-0.019598767,-0.14665441,0.0026385947,-0.01591672,0.101750836,-0.048469525,-0.039001442,0.027211009,0.055530112,-0.015535366,0.0057297572,0.09278935,-0.012531758,-0.0008913225,-0.050292224,-0.006884731,0.05952739,-0.031337157,-0.006922808,-0.04824615,-0.11116133,0.014352034,0.033950172,0.104058355,0.030547738,0.018658591,0.0066970093,0.051825885,-0.017486142,-0.08631315,-0.0013653825,-0.08353774,-0.026260817,0.011543581,-0.02731812,-0.050498325,0.044577196,0.036254194,0.005635558,0.02879708,-0.05271797,0.13288192,0.0089696655,0.07080006,0.020588873,0.019243795,-0.0071085696,0.026602784,0.006186525,-0.07297667,-0.00088424335,0.027795762,-0.030439496,-0.048103232,0.024694653,0.013473041,0.021120258,0.051860213,0.03817322,0.011151927,0.027352635,0.037186142,-0.06726837,-0.046897277,-0.018257236,0.021426922]' as distance FROM server_embeddingdatasetrecord where dataset_id=138 ORDER BY distance LIMIT 2000) as x;
When it works:

QUERY PLAN                                                                                                                                                                                                                                                     
Limit  (cost=1918.31..1923.31 rows=2000 width=1141) (actual time=31.286..31.795 rows=2000 loops=1)                                                                                                                                                             |
  ->  Sort  (cost=1918.31..1930.80 rows=4996 width=1141) (actual time=31.285..31.671 rows=2000 loops=1)                                                                                                                                                        |
        Sort Key: ((server_embeddingdatasetrecord.embedding <=> '[0.02193091,0.042891648,-0.03889269,-0.010822805,0.016204676,0.020508468,0.06264098,0.012757077,0.03802473,0.021543836,-0.0015017532,0.026285868,0.02930233,-0.07618598,0.0066387067,0.0119658|
        Sort Method: external merge  Disk: 5712kB                                                                                                                                                                                                              |
        ->  Seq Scan on server_embeddingdatasetrecord  (cost=0.00..1619.40 rows=4996 width=1141) (actual time=0.027..15.940 rows=5000 loops=1)                                                                                                                 |
              Filter: (dataset_id = 138)                                                                                                                                                                                                                       |
              Rows Removed by Filter: 5000                                                                                                                                                                                                                     |
Planning Time: 0.123 ms                                                                                                                                                                                                                                        |
Execution Time: 33.376 ms                                                                                                                                                                                                                                      |

When it failed:

QUERY PLAN                                                                                                                                                                                                                                                     |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Limit  (cost=805.50..3202.76 rows=2000 width=1129)                                                                                                                                                                                                             |
  ->  Index Scan using datasetrecord_embedding_idx on server_embeddingdatasetrecord  (cost=805.50..12796.61 rows=10004 width=1129)                                                                                                                             |
        Order By: (embedding <=> '[0.02193091,0.042891648,-0.03889269,-0.010822805,0.016204676,0.020508468,0.06264098,0.012757077,0.03802473,0.021543836,-0.0015017532,0.026285868,0.02930233,-0.07618598,0.0066387067,0.011965876,-0.00090104126,0.03431576,-0|
        Filter: (dataset_id = 138)                                                                                                                                                                                                                             |

BETTER DOCUMENTATION IS MUCH APPRECIATED!
Please help, thank you

@Drzhivago264
Copy link
Author

Drzhivago264 commented Aug 29, 2024

UPDATE:
After delete the ANN index and build a new one, the thing starts working again.
This is really strange.
Should I rebuilt the index every time I insert? If it is the case then I cannot see how pgvectorscale is usable.

Here is how it works again, it seems now working for larger number of row.

QUERY PLAN                                                                                                                                                                                                                                                     |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Limit  (cost=1400.02..3818.70 rows=2000 width=1139) (actual time=4.835..95.002 rows=2000 loops=1)                                                                                                                                                              |
  ->  Index Scan using embedding_idx on server_embeddingdatasetrecord  (cost=1400.02..33772.83 rows=26769 width=1139) (actual time=4.834..94.844 rows=2000 loops=1)                                                                                            |
        Order By: (embedding <=> '[0.0077130245,0.07420039,0.02256155,0.009903246,-0.02250973,-0.0130390115,0.041934833,0.007967299,0.04965609,0.031906717,-0.015018084,0.048861843,0.051051807,0.012919296,0.03707515,-0.008019336,-0.03178257,0.072138935,-0.|
        Filter: (dataset_id = 138)                                                                                                                                                                                                                             |
        Rows Removed by Filter: 2003                                                                                                                                                                                                                           |
Planning Time: 0.251 ms                                                                                                                                                                                                                                        |
Execution Time: 95.103 ms                                                                                                                                                                                                                                      |

@jgpruitt
Copy link

jgpruitt commented Sep 6, 2024

Should I rebuilt the index every time I insert?

No. It is not intended to work that way.

@cevian cevian added the bug Something isn't working label Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants