Ability to chunk and embed multiple HTML pages #32

anjani-pothula · 2024-10-16T18:11:47Z

As of today, oaim sandbox accepts a pdf file (from an OCI bucket or local) and an HTML page. Many times, all the content we would like to chunk and embed for a RAG chatbot will not be in a single HTML page. This limits us from using HTML input for ingestion. To overcome this limitation, the sandbox must:

Allow users to provide a list of HTML URLs from the embed and chunk page. (OR)
Crawl through all the URLs part of the HTML page given as the input. For example, if we give the TOC page as the input, the sandbox must be able to crawl through all the pages that are listed on the TOC, for chunking.

gotsysdba · 2024-12-20T15:41:17Z

Like #33

anjani-pothula changed the title ~~Crawl HTML pages during ingestion (chunk & embed)~~ Ability to chunk and embed multiple HTML pages Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to chunk and embed multiple HTML pages #32

Ability to chunk and embed multiple HTML pages #32

anjani-pothula commented Oct 16, 2024

gotsysdba commented Dec 20, 2024

Ability to chunk and embed multiple HTML pages #32

Ability to chunk and embed multiple HTML pages #32

Comments

anjani-pothula commented Oct 16, 2024

gotsysdba commented Dec 20, 2024