Add similarity search batch processing and tests #437

unicoder88 · 2024-05-19T15:22:48Z

Hi! This pull request adds ability to pass multiple texts in single request, but also is compatible with single-text format.

Solves issues #431 and #430

Considerations taken:

max input items for Workers AI - https://developers.cloudflare.com/workers-ai/models/bge-base-en-v1.5/#api-schema

Max input tokens: 512

API Schema suggests "maxItems": 100
Uncertain which is correct, but this is configurable as an environment variable MAX_INPUT, leaving 100 as a default
Query vectors doesn't seem to provide a batch support, so I opted to run all requests in parallel with Promise.all - all requests must succeed. An alternative is Promise.allSettled that can tolerate individual request errors, but then we'll have to come up with error format per item, which isn't worth it for now

In addition, basic high level integration tests were added.

evgenydmitriev · 2024-05-20T09:12:21Z

@unicoder88, what's the impact of batching on performance/costs in duplicate-heavy use cases?

unicoder88 · 2024-05-20T18:50:04Z

@evgenydmitriev , I've added unique text processing and updated tests. Only unique texts are sent to workers, but results are assigned back to original inputs.
This way duplicate-heavy use cases should still consume inbound traffic, but not increase Cloudflare usage costs.

evgenydmitriev · 2024-05-21T11:47:00Z

Users creating batches of identical messages is a good consideration, but my question wasn't about duplicates within a batch. A much bigger duplicate problem might show up across multiple requests/users. Where do you think the biggest costs are? How does your batching method affect those costs (if at all)?

unicoder88 · 2024-05-22T07:14:28Z

Thanks @evgenydmitriev . Looks like the most expensive part is Vectorize calls. I'm intending to add Cloudflare cache API over each individual vector.

Price given 1000 requests:

Workers AI

https://developers.cloudflare.com/workers-ai/platform/pricing/

$0.011 / 1,000 Regular Twitch Neurons (also known as Neurons).

https://ai.cloudflare.com/#pricing-calculator

Price $0.011 / 1,000 Neurons
768 tokens * 1000 requests = 970 Neurons = $0,0011 when over free limit

Vectorize

https://developers.cloudflare.com/vectorize/platform/pricing/
Price $0.040 per million

Queried Vector Dimensions: The total number of vector dimensions queried. If you have 50,000 vectors with 768-dimensions in an index, and make 1000 queries against that index, your total queried vector dimensions would sum to ((50000 + 1000) * 768) = 39.168 million = $1,567 (UPD. meant price here is one and a half dollar per 1000 requests, so $1.567)

evgenydmitriev · 2024-05-22T07:38:59Z

@unicoder88 your pricing math is off, and I personally think that manually implementing caching is almost always an overkill, but you are thinking in the right direction. There might be an easier solution, though.

unicoder88 · 2024-05-22T12:27:54Z

Actually, you're right @evgenydmitriev ! Obviously, a database is a better and more straightforward solution than distributed cache here, so now I'm planning to use a D1 to create a schema, then query and insert results there.

D1 pricing looks tasty:

Rows written | 100,000 / day | First 50 million / month included + $1.00 / million rows

In worst case of all new vectors and free writes exhausted, 1000 writes will cost $0.001.

evgenydmitriev · 2024-05-22T13:15:56Z

I'm not sure D1 is necessarily cheaper than Cloudflare cache, but it might be, depending on the usage. My argument was about avoiding manual caching (whatever the data store you might want to choose). There's a much simpler solution that is already implemented for the worker, but will be broken with your batching approach.

unicoder88 · 2024-05-24T20:44:35Z

Hi @evgenydmitriev ! I have some updates to bring up to you.

Firstly, worker limits:

max 6 simultaneous outgoing fetch requests
128 MB memory
Duration - no limit
CPU 30 s HTTP request

Possible bottleneck could be 6 simultaneous requests, while this batching method is attempting to execute up to 100 requests, the rest should be throttled.

However, after I've built the whole system with same batching approach, and results look better than expected. Execution looks like this:

The setup:

took some 100k texts database https://www.kaggle.com/datasets/ashishkumarak/netflix-reviews-playstore-daily-updated
inserted 10k vectors into Vectorize index using same AI model
ran searches in different combinations

Processing times, however, do not show considerable slowdown no matter concurrency or duplicates:

Batch search 36 random texts - AI 900 ms, Vectorize 2 s
72 texts - AI 2 s, Vectorize 3 s
100 texts - AI 1-4 s, Vectorize 4-5 s

Parallel execution:

2 parallel executions x search 100 random texts - AI 2 s, vectorize 5 s
10 parallel executions x search 100 random texts - AI 0.6-3 s, vectorize 6-7 s
10 parallel executions x search 100 duplicate texts - AI 0.5-4, vectorize 6-7 s

CPU time also looks good:

I would remove any duplicates checking for simplicity, and any caching because 768 dimensional vectors look unlikely to match exactly between different texts. Also, I would improve input checks for non-empty text, because empty text in a batch breaks whole batch.

I would appreciate any further hints.

evgenydmitriev · 2024-05-24T21:30:08Z

Thanks for the detailed test results. You correctly identified the Workers AI embedding models being the cost bottleneck in one of your previous comments, even accounting for the pricing calculation error. You also correctly identified the need for a caching mechanism there. The question is finding an easier and cheaper way of doing it than manually attaching a data store and querying every incoming message against it. There's an easier solution that's already implemented in the single message approach, but will be broken if we were to implement your batching mechanism as is.

unicoder88 · 2024-05-27T08:15:51Z

@evgenydmitriev, It would be nice to have URL like this - GET https://www.example.com/some-namespace?text=some%20text, and then let Cloudflare CDN cache those.
I've tried this, and it doesn't work because workers are run before cache.

What I tried:

created Cache Rules

attaching the worker to a Custom Domain

adding a bunch of Cache-Control headers to response

c.res.headers.append("Cache-Control", "public, max-age=14400, s-maxage=84000, stale-if-error=60")
c.res.headers.append("Cloudflare-CDN-Cache-Control", "max-age=24400")
c.res.headers.append("CDN-Cache-Control", "max-age=18000")

evgenydmitriev · 2024-05-27T09:41:50Z

@unicoder88, could you clarify what it is that you are trying to cache? HTTP requests containing message batches?

unicoder88 · 2024-05-27T09:58:07Z

@evgenydmitriev, well, latest suggestion is to start small and create an endpoint that works on a single message, but is cacheable by URL.
Then either have clients initiate multiple parallel requests on their own if necessary (Cloudflare will take care of scaling), or on top of it come up with some sort of gateway that takes a batch and dispatches to this worker. Latter is unclear yet.

evgenydmitriev · 2024-05-27T10:12:05Z

well, latest suggestion is to start small and create an endpoint that works on a single message, but is cacheable by URL.

Isn't this what the current worker does? You do not need a custom domain to cache HTTP requests to Cloudflare Workers.

unicoder88 · 2024-05-27T10:14:48Z

Now it's a POST body, so shouldn't be cacheable by default. In this prototype, I converted to GET (with no luck admittedly 😀)

unicoder88 · 2024-05-29T07:40:53Z

Hi @evgenydmitriev , please review updates.

unicoder88 added 3 commits May 19, 2024 16:53

Add similarity search batch processing up to current CloudFlare limits

c0cff6d

Similarity search - add tests for single and batch message processing

6805cd8

Similarity search - add Prettier config, reformat src and test

691f847

unicoder88 changed the title ~~Add similarity search batch processing~~ Add similarity search batch processing and tests May 19, 2024

Similarity search - run only unique texts

de7261f

unicoder88 added 2 commits May 28, 2024 22:48

Add AI Gateway caching, CPU limit

ea985cc

Convert to async fn, add AI Gateway response types, catch errors

b145f87

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add similarity search batch processing and tests #437

Add similarity search batch processing and tests #437

unicoder88 commented May 19, 2024 •

edited

Loading

evgenydmitriev commented May 20, 2024

unicoder88 commented May 20, 2024 •

edited

Loading

evgenydmitriev commented May 21, 2024

unicoder88 commented May 22, 2024 •

edited

Loading

evgenydmitriev commented May 22, 2024

unicoder88 commented May 22, 2024 •

edited

Loading

evgenydmitriev commented May 22, 2024

unicoder88 commented May 24, 2024

evgenydmitriev commented May 24, 2024

unicoder88 commented May 27, 2024 •

edited

Loading

evgenydmitriev commented May 27, 2024

unicoder88 commented May 27, 2024

evgenydmitriev commented May 27, 2024 •

edited

Loading

unicoder88 commented May 27, 2024

unicoder88 commented May 29, 2024

Add similarity search batch processing and tests #437

Are you sure you want to change the base?

Add similarity search batch processing and tests #437

Conversation

unicoder88 commented May 19, 2024 • edited Loading

evgenydmitriev commented May 20, 2024

unicoder88 commented May 20, 2024 • edited Loading

evgenydmitriev commented May 21, 2024

unicoder88 commented May 22, 2024 • edited Loading

evgenydmitriev commented May 22, 2024

unicoder88 commented May 22, 2024 • edited Loading

evgenydmitriev commented May 22, 2024

unicoder88 commented May 24, 2024

evgenydmitriev commented May 24, 2024

unicoder88 commented May 27, 2024 • edited Loading

evgenydmitriev commented May 27, 2024

unicoder88 commented May 27, 2024

evgenydmitriev commented May 27, 2024 • edited Loading

unicoder88 commented May 27, 2024

unicoder88 commented May 29, 2024

unicoder88 commented May 19, 2024 •

edited

Loading

unicoder88 commented May 20, 2024 •

edited

Loading

unicoder88 commented May 22, 2024 •

edited

Loading

unicoder88 commented May 22, 2024 •

edited

Loading

unicoder88 commented May 27, 2024 •

edited

Loading

evgenydmitriev commented May 27, 2024 •

edited

Loading