Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to chunk and embed multiple HTML pages #32

Open
anjani-pothula opened this issue Oct 16, 2024 · 1 comment
Open

Ability to chunk and embed multiple HTML pages #32

anjani-pothula opened this issue Oct 16, 2024 · 1 comment

Comments

@anjani-pothula
Copy link

As of today, oaim sandbox accepts a pdf file (from an OCI bucket or local) and an HTML page. Many times, all the content we would like to chunk and embed for a RAG chatbot will not be in a single HTML page. This limits us from using HTML input for ingestion. To overcome this limitation, the sandbox must:

  1. Allow users to provide a list of HTML URLs from the embed and chunk page. (OR)
  2. Crawl through all the URLs part of the HTML page given as the input. For example, if we give the TOC page as the input, the sandbox must be able to crawl through all the pages that are listed on the TOC, for chunking.
@anjani-pothula anjani-pothula changed the title Crawl HTML pages during ingestion (chunk & embed) Ability to chunk and embed multiple HTML pages Oct 16, 2024
@gotsysdba
Copy link
Contributor

Like #33

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants