You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
Please do not modify this template :) and fill in all the required fields.
1. Is this request related to a challenge you're experiencing? Tell me about your story.
I am developing a chatbot using Dify that identifies and presents relevant sections of manuals (created in HTML format) to answer business-related inquiries.
When creating the knowledge base, I converted 700 manuals written in HTML into PDF files via web rendering (each page's footer includes a URL) and loaded these files into the system.
In this approach, each chunk also included a corresponding URL.
As a result, the chatbot successfully provided both a summary of the answer and a relevant URL, yielding highly satisfactory outcomes.
Looking ahead, since the cloud version of Dify has a 1,000-page limit and the manuals I need to process clearly exceed this limit, I attempted to merge the 700 HTML files (including URLs) into a single Markdown file using Python and created the knowledge base in parent-child split mode.
However, this approach did not generate the expected chunk structure.
Ideally, I expected parent chunks to contain only titles and URLs, while child chunks would contain the article's main content.
Instead, some chunks contained mixed URLs from different pages within a single title.
Therefore, I would like to ask:
Is there a way to process a Markdown file (formatted with titles, URLs, and content separated by # headers) in a way that produces well-structured chunk data, similar to the successful PDF-based approach?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Self Checks
1. Is this request related to a challenge you're experiencing? Tell me about your story.
I am developing a chatbot using Dify that identifies and presents relevant sections of manuals (created in HTML format) to answer business-related inquiries.
When creating the knowledge base, I converted 700 manuals written in HTML into PDF files via web rendering (each page's footer includes a URL) and loaded these files into the system.
In this approach, each chunk also included a corresponding URL.
As a result, the chatbot successfully provided both a summary of the answer and a relevant URL, yielding highly satisfactory outcomes.
Looking ahead, since the cloud version of Dify has a 1,000-page limit and the manuals I need to process clearly exceed this limit, I attempted to merge the 700 HTML files (including URLs) into a single Markdown file using Python and created the knowledge base in parent-child split mode.
However, this approach did not generate the expected chunk structure.
Ideally, I expected parent chunks to contain only titles and URLs, while child chunks would contain the article's main content.
Instead, some chunks contained mixed URLs from different pages within a single title.
Therefore, I would like to ask:
Is there a way to process a Markdown file (formatted with titles, URLs, and content separated by # headers) in a way that produces well-structured chunk data, similar to the successful PDF-based approach?
Any guidance would be greatly appreciated.
2. Additional context or comments
No response
Beta Was this translation helpful? Give feedback.
All reactions