Skip to content
This repository has been archived by the owner on Sep 3, 2024. It is now read-only.

Add simple fragmentation scheme #7

Merged
merged 19 commits into from
Aug 12, 2024
Merged

Conversation

karol-t-wilk
Copy link
Collaborator

Add basic preprocessing for the docs, which also removes the duplication of docs in the database.

The fragmentation scheme added is the simplest I could come up with that would guarantee an upper bound on sequence length while avoiding slicing inside words/graphemes.

The fragmentation also means that at the moment doc items with no documentation are not embedded, even if, say, the function name alone would be useful - this is definitely one of the things that will have to be addressed in the next steps of development.

@karol-t-wilk karol-t-wilk requested a review from wojtekmach August 5, 2024 12:54
lib/mix/tasks/search.add.ex Outdated Show resolved Hide resolved
str = " words and some more words"

assert FragmentationScheme.split(str, max_size: 15) == [
" words and ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity what's the rationale behind keeping trailing \n\t, leading whitespace, etc on returned words in the list?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so that the original docs can be recreated from the fragments by simple concatenation. If further preprocessing that would discard some characters is needed, it can always be done on the fly just before embedding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

test "when cannot split along whitespace, splits along grapheme boundaries" do
str1 = "asdfghjkl"

assert FragmentationScheme.split(str1, max_size: 5) == [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what max_size do you expect to use in production? Is max_size going to vary? If not, perhaps we don't need the --max-size CLI option?

Copy link
Collaborator Author

@karol-t-wilk karol-t-wilk Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the small models used here it would be the context size of the models so that information from the fragments isn't discarded when trimming the token tensor to the context size of the embedding model, but in production the context size would probably be very large, so it would be more of a hyperparameter. From my previous experiments, the results from fragmentation can influence lookup precision in unexpected ways, so it will probably have to be benchmarked on a per-model basis. In general, tuning the max_size value provides control over the balance of how specific the search results are and how correct they are

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

{[single_word], [acc_head | acc_tail]} ->
# we can try extending the last word in accumulator
if byte_size(acc_head) + byte_size(single_word) <= max_size do
do_split(next_text, [acc_head <> single_word | acc_tail], max_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are incrementally building the binaries, but ultimately what we want to do is split a binary into adjacent chunks, so it would be more memory-efficient to accumulate chunks as {offset, size} and then produce the corresponding slices at the end.

Also, this could be a recursion using String.next_grapheme/2 to iterate over all graphemes (avoiding Regex.run). One way to do it would be to track the current string, offset, chunk start position and previous word end position. This is a less important optimisation, but it may actually end up simpler, so I am just dropping the idea :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did as advised, and the function is indeed simpler, and I even discovered another edge case to be covered while redoing the it. The function seems rather ugly still though to me, so if you have any further suggestions about it I'd be more than happy to hear them

Comment on lines 31 to 34
defp split_binary([split_size | splits_tail], start_idx, binary, acc) do
current_part = binary_slice(binary, start_idx, split_size)
split_binary(splits_tail, start_idx + split_size, binary, [current_part | acc])
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify this a bit with body-recursion:

defp split_binary([], ""), do: []

defp split_binary([size | sizes], string) do
  <<chunk::binary-size(^size), rest::binary>> = string
  [chunk | split_binary(sizes, rest)]
end

A body-recursive function is fine here, because we are building a list result. In such setting, a tail-recursive function with list accumulator uses basically the same amount of memory. More details in this paragraph :)

# graph is whitespace
if next_word_size == 0 do
# we are still building the current chunk
if whole_chunk_size + graph_size <= max_size do
Copy link
Contributor

@jonatanklosko jonatanklosko Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling we could reduce the number of conditionals, but I may be missing something. What I have in mind is this:

if chunk_size + grapheme_size > max_chunk_size do
  if "the chunk has at least one word + whitespace" do
    # emit chunk until the last seen word ending
  else
    # split at the current point (mid-word) and emit chunk
  end
else
  # if whitespace, record new word ending
end

The high-level idea is that we keep track of the current position, the current chunk start position and a "possible" chunk ending. We bump the "possible" chunk ending whenever we get a whitespace. Once we overflow the max size we need to emit a chunk, which is either:

  • if "possible" chunk ending > chunk start, that's our chunk
  • otherwise it means there was no whitespace, so the chunk is from chunk start all the way to the current position

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought about it agin and realised I was unnecessarily thinking in terms of offsets, so this can be even simpler:

do_split(string, max_size, 0, nil, [])

defp do_split("", _, size, _, sizes), do: Enum.reverse(sizes, [size])

defp do_split(string, max_size, size, size_until_word, sizes) do
  {grapheme, string} = String.next_grapheme(string)
  grapheme_size = byte_size(grapheme)

  if size + grapheme_size > max_size do
    if size_until_word do
      # Split before the current unfinished word
      next = size - size_until_word
      do_split(string, max_size, next + grapheme_size, nil, [size_until_word | sizes])
    else
      # The current chunk has a single word, just split it
      do_split(string, max_size, grapheme_size, nil, [size | sizes])
    end
  else
    new_size = size + grapheme_size
    size_until_word = if whitespace?(grapheme), do: new_size, else: size_until_word
    do_split(string, max_size, new_size, size_until_word, sizes)
  end
end

Where size_until_word is the chunk size until (including) the last whitespace, or nil if there was no whitespace.

I think this is close to your implementation, just with compressed the conditionals :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply, had to take a day off. You are right, the function does the same thing in 1/3 the code haha. The way I was looking at it made me convinced the problem was a lot more complex than it actually was... The wonders of code review

karol-t-wilk and others added 2 commits August 12, 2024 12:32
Co-authored-by: Jonatan Kłosko <[email protected]>
@karol-t-wilk karol-t-wilk merged commit 9da417c into main Aug 12, 2024
1 check passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants