Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a better algorithm for splitting pdf text #270

Open
tim-macphail opened this issue Mar 25, 2023 · 0 comments
Open

Develop a better algorithm for splitting pdf text #270

tim-macphail opened this issue Mar 25, 2023 · 0 comments
Labels
backend enhancement New feature or request

Comments

@tim-macphail
Copy link
Collaborator

Background

The gpt-3.5-turbo api can only handle 4096 tokens (~3000 words) in one completion, including all messages (request and response, system prompt).

Note: It is possible to precisely tokenize text with transformers.GPT2Tokenizer library.

So we need to split the full text into chunks and process those chunks separately (and concurrently for performance).

Currently the text is being split with this method

def split(text: str) -> List[str]:
"""Splits the text into chunks of 50 sentences each"""
sentences = text.split(". ")
sentences_per_chunk = 50
chunks = [
". ".join(sentences[i : i + sentences_per_chunk]) + ". "
for i in range(0, len(sentences), sentences_per_chunk)
]
return chunks

Task

Design a new splitting algorithm so that instead of splitting every 50 sentences, it will split on logical boundaries in the full text. This will hopefully yield more accurate results from text completions. This could also include trimming the text to exclude unimportant data (e.g. the last 2 pages of most u of c outlines).

Getting started

A possible avenue for achieving this using an NLP library. @harsweet seemed to have some idea on this. There are techniques available for identifying the logical boundaries in text that we could apply. This will be a challenging issue to take on!

@tim-macphail tim-macphail added enhancement New feature or request backend labels Mar 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant