Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docling treating every line break in markdown files as new paragraph #514

Open
bbrowning opened this issue Jan 28, 2025 · 2 comments
Open

Comments

@bbrowning
Copy link
Contributor

While spot-checking the samples we're creating from knowledge documents now that we're sending everything (including markdown) through Docling, I realized that any line breaks in a markdown file are getting treated as a separate paragraph by Docling.

Here's a short example to illustrate:

Knowledge doc

**Phoenix** is a minor [constellation](constellation "wikilink") in the
[southern sky](southern_sky "wikilink"). Named after the mythical
[phoenix](Phoenix_(mythology) "wikilink"), it was first depicted on a

Docling-generated markdown output

Convert that input phoenix.md to markdown with: docling --from md --to md input/phoenix.md

Phoenix is a minor constellation in the

southern sky. Named after the mythical

phoenix, it was first depicted on a

Docling-generated json output

Convert that input phoenix.md to json with: docling --from md --to json input/phoenix.md

    {
      "self_ref": "#/texts/1",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "label": "paragraph",
      "prov": [],
      "orig": "Phoenix is a minor constellation in the",
      "text": "Phoenix is a minor constellation in the"
    },
    {
      "self_ref": "#/texts/2",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "label": "paragraph",
      "prov": [],
      "orig": "southern sky. Named after the mythical",
      "text": "southern sky. Named after the mythical"
    },
    {
      "self_ref": "#/texts/3",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "label": "paragraph",
      "prov": [],
      "orig": "phoenix, it was first depicted on a",
      "text": "phoenix, it was first depicted on a"
    },

Input samples created for knowledge pipeline

This ends up creating additional newlines in our jsonl samples created as input to our knowledge pipeline. Here's an excerpt from that:

{"document": "Phoenix is a minor constellation in the\n\n**southern sky** Named after the mythical\n\nphoenix, it was first depicted on a\n\n"}

I'm not entirely sure what impact all the additional newlines will have in the quality of generated data or how it will impact the model fine-tuning, but probably worth investigation.

@bbrowning
Copy link
Contributor Author

Also filed DS4SD/docling#822 to track this behavior in Docling itself, as I was able to reproduce with the latest version.

@bbrowning
Copy link
Contributor Author

This is fixed in docling 2.17.0, but leaving this issue open until SDG upgrades to that version or newer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant