Add simple fragmentation scheme #7

karol-t-wilk · 2024-08-05T10:40:08Z

Add basic preprocessing for the docs, which also removes the duplication of docs in the database.

The fragmentation scheme added is the simplest I could come up with that would guarantee an upper bound on sequence length while avoiding slicing inside words/graphemes.

The fragmentation also means that at the moment doc items with no documentation are not embedded, even if, say, the function name alone would be useful - this is definitely one of the things that will have to be addressed in the next steps of development.

lib/mix/tasks/search.add.ex

wojtekmach · 2024-08-05T14:49:22Z

test/search/fragmentation_scheme_test.exs

+      str = "    words and some more words"
+
+      assert FragmentationScheme.split(str, max_size: 15) == [
+               "    words and ",


out of curiosity what's the rationale behind keeping trailing \n\t, leading whitespace, etc on returned words in the list?

It's so that the original docs can be recreated from the fragments by simple concatenation. If further preprocessing that would discard some characters is needed, it can always be done on the fly just before embedding

Got it, thanks!

wojtekmach · 2024-08-05T14:51:13Z

test/search/fragmentation_scheme_test.exs

+    test "when cannot split along whitespace, splits along grapheme boundaries" do
+      str1 = "asdfghjkl"
+
+      assert FragmentationScheme.split(str1, max_size: 5) == [


Out of curiosity, what max_size do you expect to use in production? Is max_size going to vary? If not, perhaps we don't need the --max-size CLI option?

For the small models used here it would be the context size of the models so that information from the fragments isn't discarded when trimming the token tensor to the context size of the embedding model, but in production the context size would probably be very large, so it would be more of a hyperparameter. From my previous experiments, the results from fragmentation can influence lookup precision in unexpected ways, so it will probably have to be benchmarked on a per-model basis. In general, tuning the max_size value provides control over the balance of how specific the search results are and how correct they are

Got it, thanks!

jonatanklosko · 2024-08-05T17:36:19Z

lib/search/fragmentation_scheme.ex

+      {[single_word], [acc_head | acc_tail]} ->
+        # we can try extending the last word in accumulator
+        if byte_size(acc_head) + byte_size(single_word) <= max_size do
+          do_split(next_text, [acc_head <> single_word | acc_tail], max_size)


Here we are incrementally building the binaries, but ultimately what we want to do is split a binary into adjacent chunks, so it would be more memory-efficient to accumulate chunks as {offset, size} and then produce the corresponding slices at the end.

Also, this could be a recursion using String.next_grapheme/2 to iterate over all graphemes (avoiding Regex.run). One way to do it would be to track the current string, offset, chunk start position and previous word end position. This is a less important optimisation, but it may actually end up simpler, so I am just dropping the idea :)

Did as advised, and the function is indeed simpler, and I even discovered another edge case to be covered while redoing the it. The function seems rather ugly still though to me, so if you have any further suggestions about it I'd be more than happy to hear them

…he end of processing and remove Regex.run

jonatanklosko · 2024-08-08T12:24:15Z

lib/search/fragmentation_scheme.ex

+  defp split_binary([split_size | splits_tail], start_idx, binary, acc) do
+    current_part = binary_slice(binary, start_idx, split_size)
+    split_binary(splits_tail, start_idx + split_size, binary, [current_part | acc])
+  end


We can simplify this a bit with body-recursion:

defp split_binary([], ""), do: [] defp split_binary([size | sizes], string) do <<chunk::binary-size(^size), rest::binary>> = string [chunk | split_binary(sizes, rest)] end

A body-recursive function is fine here, because we are building a list result. In such setting, a tail-recursive function with list accumulator uses basically the same amount of memory. More details in this paragraph :)

jonatanklosko · 2024-08-08T12:30:36Z

lib/search/fragmentation_scheme.ex

+      # graph is whitespace
+      if next_word_size == 0 do
+        # we are still building the current chunk
+        if whole_chunk_size + graph_size <= max_size do


I have a feeling we could reduce the number of conditionals, but I may be missing something. What I have in mind is this:

if chunk_size + grapheme_size > max_chunk_size do if "the chunk has at least one word + whitespace" do # emit chunk until the last seen word ending else # split at the current point (mid-word) and emit chunk end else # if whitespace, record new word ending end

The high-level idea is that we keep track of the current position, the current chunk start position and a "possible" chunk ending. We bump the "possible" chunk ending whenever we get a whitespace. Once we overflow the max size we need to emit a chunk, which is either:

if "possible" chunk ending > chunk start, that's our chunk

otherwise it means there was no whitespace, so the chunk is from chunk start all the way to the current position

I just thought about it agin and realised I was unnecessarily thinking in terms of offsets, so this can be even simpler:

do_split(string, max_size, 0, nil, []) defp do_split("", _, size, _, sizes), do: Enum.reverse(sizes, [size]) defp do_split(string, max_size, size, size_until_word, sizes) do {grapheme, string} = String.next_grapheme(string) grapheme_size = byte_size(grapheme) if size + grapheme_size > max_size do if size_until_word do # Split before the current unfinished word next = size - size_until_word do_split(string, max_size, next + grapheme_size, nil, [size_until_word | sizes]) else # The current chunk has a single word, just split it do_split(string, max_size, grapheme_size, nil, [size | sizes]) end else new_size = size + grapheme_size size_until_word = if whitespace?(grapheme), do: new_size, else: size_until_word do_split(string, max_size, new_size, size_until_word, sizes) end end

Where size_until_word is the chunk size until (including) the last whitespace, or nil if there was no whitespace.

I think this is close to your implementation, just with compressed the conditionals :)

Sorry for the late reply, had to take a day off. You are right, the function does the same thing in 1/3 the code haha. The way I was looking at it made me convinced the problem was a lot more complex than it actually was... The wonders of code review

lib/search/fragmentation_scheme.ex

Co-authored-by: Jonatan Kłosko <[email protected]>

karol-t-wilk added 11 commits June 25, 2024 15:55

Remove duplication of docs in database

7ca6918

Add simple fragmentation scheme

98ee63c

Update code for better clarity

a3af61f

Add recombine function

4878377

Add test suite for fragmentatin scheme

3477c7e

Add tests for recombine

4cdf5a4

Fix pattern matching in HexClient.get_releases

f45e37e

Update search.add mix task to work with fragmentation

690443f

Add handling for fragmentation in web view

96f444f

Run formatter

9ce978d

Add sorting for recombining doc fragments

e56a23d

karol-t-wilk requested a review from wojtekmach August 5, 2024 12:54

wojtekmach reviewed Aug 5, 2024

View reviewed changes

Use OptionParser instead of homebrew parsing

89c5ffd

jonatanklosko reviewed Aug 5, 2024

View reviewed changes

karol-t-wilk added 3 commits August 8, 2024 11:26

Add a new edge case to fragmentation tests

fb3b716

Overhaul FragmentationScheme.split to only build output binaries at t…

5d9ee42

…he end of processing and remove Regex.run

Run formatter

054dc53

jonatanklosko reviewed Aug 8, 2024

View reviewed changes

karol-t-wilk added 2 commits August 8, 2024 14:50

Remove unnecessary tail recursion

4032d35

Overhaul the compute_splits function once more

271a5c6

jonatanklosko reviewed Aug 12, 2024

View reviewed changes

lib/search/fragmentation_scheme.ex Show resolved Hide resolved

jonatanklosko reviewed Aug 12, 2024

View reviewed changes

lib/search/fragmentation_scheme.ex Outdated Show resolved Hide resolved

jonatanklosko approved these changes Aug 12, 2024

View reviewed changes

karol-t-wilk and others added 2 commits August 12, 2024 12:32

Add docs to FragmentationString.recombine/1

f63c594

Co-authored-by: Jonatan Kłosko <[email protected]>

Change formatting

393ae2d

Co-authored-by: Jonatan Kłosko <[email protected]>

karol-t-wilk merged commit 9da417c into main Aug 12, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add simple fragmentation scheme #7

Add simple fragmentation scheme #7

karol-t-wilk commented Aug 5, 2024

wojtekmach Aug 5, 2024

karol-t-wilk Aug 5, 2024

wojtekmach Aug 5, 2024

wojtekmach Aug 5, 2024

karol-t-wilk Aug 5, 2024 •

edited

Loading

wojtekmach Aug 5, 2024

jonatanklosko Aug 5, 2024

karol-t-wilk Aug 8, 2024

jonatanklosko Aug 8, 2024

jonatanklosko Aug 8, 2024 •

edited

Loading

jonatanklosko Aug 9, 2024

karol-t-wilk Aug 12, 2024

Add simple fragmentation scheme #7

Add simple fragmentation scheme #7

Conversation

karol-t-wilk commented Aug 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karol-t-wilk Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonatanklosko Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karol-t-wilk Aug 5, 2024 •

edited

Loading

jonatanklosko Aug 8, 2024 •

edited

Loading