You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As of today, oaim sandbox accepts a pdf file (from an OCI bucket or local) and an HTML page. Many times, all the content we would like to chunk and embed for a RAG chatbot will not be in a single HTML page. This limits us from using HTML input for ingestion. To overcome this limitation, the sandbox must:
Allow users to provide a list of HTML URLs from the embed and chunk page. (OR)
Crawl through all the URLs part of the HTML page given as the input. For example, if we give the TOC page as the input, the sandbox must be able to crawl through all the pages that are listed on the TOC, for chunking.
The text was updated successfully, but these errors were encountered:
anjani-pothula
changed the title
Crawl HTML pages during ingestion (chunk & embed)
Ability to chunk and embed multiple HTML pages
Oct 16, 2024
As of today, oaim sandbox accepts a pdf file (from an OCI bucket or local) and an HTML page. Many times, all the content we would like to chunk and embed for a RAG chatbot will not be in a single HTML page. This limits us from using HTML input for ingestion. To overcome this limitation, the sandbox must:
The text was updated successfully, but these errors were encountered: