Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/quotes from markdown are stripped out #3309

Open
gaspardpetit opened this issue Jun 27, 2024 · 4 comments · May be fixed by #3218
Open

bug/quotes from markdown are stripped out #3309

gaspardpetit opened this issue Jun 27, 2024 · 4 comments · May be fixed by #3218
Labels
bug Something isn't working html

Comments

@gaspardpetit
Copy link

Describe the bug
In markdown documents, unstructure strips out quotes.

To Reproduce
Save the following to test.md:

# Hello World!!!

Number 1
>1

Number 2
> 2

Number 3
>  3

Number 4
>   4

Number 5
>    5

Number 6
>     6

Number 7
>      7

Run the following code:

from unstructured.partition.auto import partition
elements = partition("test.md")
print("\n\n".join([str(el) for el in elements]))

Observe that the output does not contain any of the markdown quotes:

Hello World!!!

Number 1

Number 2

Number 3

Number 4

Number 5

Number 6

Number 7

Expected behavior
I expect the following output:

Hello World!!!

Number 1
1
Number 2
2
Number 3
3
Number 4
4
Number 5
5
Number 6
6
Number 7
7

Screenshots
image
image

@scanny
Copy link
Collaborator

scanny commented Jun 27, 2024

@gaspardpetit I believe the problem here is the single-character paragraphs. If you make the block quotes two or more characters long you should get the behavior you're looking for.

md_text = """
# Hello World!!!
Number 1
> Lorem
Number 2
> Ipsum
Number 3
> Dolor
"""
elements = partition_md(text=md_text)
print("\n".join(e.text for e in elements))

produces

Hello World!!!
Number 1
Lorem
Number 2
Ipsum
Number 3
Dolor

Any "regular" text element with only a single character of text is dropped:
https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/html/parser.py#L661-L662

I'm not sure what the original provenance of that rule is but it makes sense to me that a single-character paragraph is either likely noise of some kind of at least no meaningful downstream.

@scanny scanny closed this as completed Jun 27, 2024
@gaspardpetit
Copy link
Author

gaspardpetit commented Jun 27, 2024

Thanks for looking at this - I guess I over simplified the bug. Here is one with more than one character:

# Hello World!!!

Number 1
>...Number 1

Number 2
> ...Number 2

Number 3
>  ...Number 3

Number 4
>   ...Number 4

Number 5
>    ...Number 5

Number 6
>     ...Number 6

Number 7
>      ...Number 7

produces:

Hello World!!!

Number 1

...Number 1

Number 2

...Number 2

Number 3

...Number 3

Number 4

...Number 4

Number 5

...Number 5

Number 6

Number 7

Notice how "...Number 6" and "...Number 7" have been removed.

This seems to happen systematically on quotes ('>') followed 5 spaces or more. My documents are all formatted like this, so they get completely destroyed by this framework :)

@scanny scanny reopened this Jun 27, 2024
@scanny
Copy link
Collaborator

scanny commented Jun 27, 2024

Hmmmm, very interesting! I notice that removing the blank line between the Number N\n> ...Number N pairs produces the expected output. I did that the first time just for compactness.

Okay, so this is actually a bug in the HTML parser, good news is it's fixed in #3218 which should be out in a few days. I'll add this to the list of bugs it fixes :)

@scanny scanny added the html label Jun 27, 2024
@scanny scanny linked a pull request Jun 27, 2024 that will close this issue
@gaspardpetit
Copy link
Author

Brilliant, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working html
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants