-
Notifications
You must be signed in to change notification settings - Fork 1
Conversation
str = " words and some more words" | ||
|
||
assert FragmentationScheme.split(str, max_size: 15) == [ | ||
" words and ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity what's the rationale behind keeping trailing \n\t, leading whitespace, etc on returned words in the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's so that the original docs can be recreated from the fragments by simple concatenation. If further preprocessing that would discard some characters is needed, it can always be done on the fly just before embedding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks!
test "when cannot split along whitespace, splits along grapheme boundaries" do | ||
str1 = "asdfghjkl" | ||
|
||
assert FragmentationScheme.split(str1, max_size: 5) == [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what max_size do you expect to use in production? Is max_size going to vary? If not, perhaps we don't need the --max-size
CLI option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the small models used here it would be the context size of the models so that information from the fragments isn't discarded when trimming the token tensor to the context size of the embedding model, but in production the context size would probably be very large, so it would be more of a hyperparameter. From my previous experiments, the results from fragmentation can influence lookup precision in unexpected ways, so it will probably have to be benchmarked on a per-model basis. In general, tuning the max_size value provides control over the balance of how specific the search results are and how correct they are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks!
lib/search/fragmentation_scheme.ex
Outdated
{[single_word], [acc_head | acc_tail]} -> | ||
# we can try extending the last word in accumulator | ||
if byte_size(acc_head) + byte_size(single_word) <= max_size do | ||
do_split(next_text, [acc_head <> single_word | acc_tail], max_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are incrementally building the binaries, but ultimately what we want to do is split a binary into adjacent chunks, so it would be more memory-efficient to accumulate chunks as {offset, size}
and then produce the corresponding slices at the end.
Also, this could be a recursion using String.next_grapheme/2
to iterate over all graphemes (avoiding Regex.run
). One way to do it would be to track the current string, offset, chunk start position and previous word end position. This is a less important optimisation, but it may actually end up simpler, so I am just dropping the idea :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did as advised, and the function is indeed simpler, and I even discovered another edge case to be covered while redoing the it. The function seems rather ugly still though to me, so if you have any further suggestions about it I'd be more than happy to hear them
…he end of processing and remove Regex.run
lib/search/fragmentation_scheme.ex
Outdated
defp split_binary([split_size | splits_tail], start_idx, binary, acc) do | ||
current_part = binary_slice(binary, start_idx, split_size) | ||
split_binary(splits_tail, start_idx + split_size, binary, [current_part | acc]) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify this a bit with body-recursion:
defp split_binary([], ""), do: []
defp split_binary([size | sizes], string) do
<<chunk::binary-size(^size), rest::binary>> = string
[chunk | split_binary(sizes, rest)]
end
A body-recursive function is fine here, because we are building a list result. In such setting, a tail-recursive function with list accumulator uses basically the same amount of memory. More details in this paragraph :)
lib/search/fragmentation_scheme.ex
Outdated
# graph is whitespace | ||
if next_word_size == 0 do | ||
# we are still building the current chunk | ||
if whole_chunk_size + graph_size <= max_size do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling we could reduce the number of conditionals, but I may be missing something. What I have in mind is this:
if chunk_size + grapheme_size > max_chunk_size do
if "the chunk has at least one word + whitespace" do
# emit chunk until the last seen word ending
else
# split at the current point (mid-word) and emit chunk
end
else
# if whitespace, record new word ending
end
The high-level idea is that we keep track of the current position, the current chunk start position and a "possible" chunk ending. We bump the "possible" chunk ending whenever we get a whitespace. Once we overflow the max size we need to emit a chunk, which is either:
- if "possible" chunk ending > chunk start, that's our chunk
- otherwise it means there was no whitespace, so the chunk is from chunk start all the way to the current position
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just thought about it agin and realised I was unnecessarily thinking in terms of offsets, so this can be even simpler:
do_split(string, max_size, 0, nil, [])
defp do_split("", _, size, _, sizes), do: Enum.reverse(sizes, [size])
defp do_split(string, max_size, size, size_until_word, sizes) do
{grapheme, string} = String.next_grapheme(string)
grapheme_size = byte_size(grapheme)
if size + grapheme_size > max_size do
if size_until_word do
# Split before the current unfinished word
next = size - size_until_word
do_split(string, max_size, next + grapheme_size, nil, [size_until_word | sizes])
else
# The current chunk has a single word, just split it
do_split(string, max_size, grapheme_size, nil, [size | sizes])
end
else
new_size = size + grapheme_size
size_until_word = if whitespace?(grapheme), do: new_size, else: size_until_word
do_split(string, max_size, new_size, size_until_word, sizes)
end
end
Where size_until_word
is the chunk size until (including) the last whitespace, or nil
if there was no whitespace.
I think this is close to your implementation, just with compressed the conditionals :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply, had to take a day off. You are right, the function does the same thing in 1/3 the code haha. The way I was looking at it made me convinced the problem was a lot more complex than it actually was... The wonders of code review
Co-authored-by: Jonatan Kłosko <[email protected]>
Co-authored-by: Jonatan Kłosko <[email protected]>
Add basic preprocessing for the docs, which also removes the duplication of docs in the database.
The fragmentation scheme added is the simplest I could come up with that would guarantee an upper bound on sequence length while avoiding slicing inside words/graphemes.
The fragmentation also means that at the moment doc items with no documentation are not embedded, even if, say, the function name alone would be useful - this is definitely one of the things that will have to be addressed in the next steps of development.