Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to chunk and embed multiple HTML pages #32

Open
anjani-pothula opened this issue Oct 16, 2024 · 0 comments
Open

Ability to chunk and embed multiple HTML pages #32

anjani-pothula opened this issue Oct 16, 2024 · 0 comments

Comments

@anjani-pothula
Copy link

As of today, oaim sandbox accepts a pdf file (from an OCI bucket or local) and an HTML page. Many times, all the content we would like to chunk and embed for a RAG chatbot will not be in a single HTML page. This limits us from using HTML input for ingestion. To overcome this limitation, the sandbox must:

  1. Allow users to provide a list of HTML URLs from the embed and chunk page. (OR)
  2. Crawl through all the URLs part of the HTML page given as the input. For example, if we give the TOC page as the input, the sandbox must be able to crawl through all the pages that are listed on the TOC, for chunking.
@anjani-pothula anjani-pothula changed the title Crawl HTML pages during ingestion (chunk & embed) Ability to chunk and embed multiple HTML pages Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant