You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While spot-checking the samples we're creating from knowledge documents now that we're sending everything (including markdown) through Docling, I realized that any line breaks in a markdown file are getting treated as a separate paragraph by Docling.
Here's a short example to illustrate:
Knowledge doc
**Phoenix** is a minor [constellation](constellation"wikilink") in the
[southern sky](southern_sky"wikilink"). Named after the mythical
[phoenix](Phoenix_(mythology)"wikilink"), it was first depicted on a
Docling-generated markdown output
Convert that input phoenix.md to markdown with: docling --from md --to md input/phoenix.md
Phoenix is a minor constellation in the
southern sky. Named after the mythical
phoenix, it was first depicted on a
Docling-generated json output
Convert that input phoenix.md to json with: docling --from md --to json input/phoenix.md
{
"self_ref": "#/texts/1",
"parent": {
"$ref": "#/body"
},
"children": [],
"label": "paragraph",
"prov": [],
"orig": "Phoenix is a minor constellation in the",
"text": "Phoenix is a minor constellation in the"
},
{
"self_ref": "#/texts/2",
"parent": {
"$ref": "#/body"
},
"children": [],
"label": "paragraph",
"prov": [],
"orig": "southern sky. Named after the mythical",
"text": "southern sky. Named after the mythical"
},
{
"self_ref": "#/texts/3",
"parent": {
"$ref": "#/body"
},
"children": [],
"label": "paragraph",
"prov": [],
"orig": "phoenix, it was first depicted on a",
"text": "phoenix, it was first depicted on a"
},
Input samples created for knowledge pipeline
This ends up creating additional newlines in our jsonl samples created as input to our knowledge pipeline. Here's an excerpt from that:
{"document": "Phoenix is a minor constellation in the\n\n**southern sky** Named after the mythical\n\nphoenix, it was first depicted on a\n\n"}
I'm not entirely sure what impact all the additional newlines will have in the quality of generated data or how it will impact the model fine-tuning, but probably worth investigation.
The text was updated successfully, but these errors were encountered:
While spot-checking the samples we're creating from knowledge documents now that we're sending everything (including markdown) through Docling, I realized that any line breaks in a markdown file are getting treated as a separate paragraph by Docling.
Here's a short example to illustrate:
Knowledge doc
Docling-generated markdown output
Convert that input phoenix.md to markdown with:
docling --from md --to md input/phoenix.md
Docling-generated json output
Convert that input phoenix.md to json with:
docling --from md --to json input/phoenix.md
Input samples created for knowledge pipeline
This ends up creating additional newlines in our jsonl samples created as input to our knowledge pipeline. Here's an excerpt from that:
I'm not entirely sure what impact all the additional newlines will have in the quality of generated data or how it will impact the model fine-tuning, but probably worth investigation.
The text was updated successfully, but these errors were encountered: