-
Notifications
You must be signed in to change notification settings - Fork 583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug/quotes from markdown are stripped out #3309
Comments
@gaspardpetit I believe the problem here is the single-character paragraphs. If you make the block quotes two or more characters long you should get the behavior you're looking for. md_text = """
# Hello World!!!
Number 1
> Lorem
Number 2
> Ipsum
Number 3
> Dolor
"""
elements = partition_md(text=md_text)
print("\n".join(e.text for e in elements)) produces
Any "regular" text element with only a single character of text is dropped: I'm not sure what the original provenance of that rule is but it makes sense to me that a single-character paragraph is either likely noise of some kind of at least no meaningful downstream. |
Thanks for looking at this - I guess I over simplified the bug. Here is one with more than one character:
produces:
Notice how "...Number 6" and "...Number 7" have been removed. This seems to happen systematically on quotes ('>') followed 5 spaces or more. My documents are all formatted like this, so they get completely destroyed by this framework :) |
Hmmmm, very interesting! I notice that removing the blank line between the Okay, so this is actually a bug in the HTML parser, good news is it's fixed in #3218 which should be out in a few days. I'll add this to the list of bugs it fixes :) |
Brilliant, thanks a lot! |
Describe the bug
In markdown documents, unstructure strips out quotes.
To Reproduce
Save the following to
test.md
:Run the following code:
Observe that the output does not contain any of the markdown quotes:
Expected behavior
I expect the following output:
Screenshots
![image](https://private-user-images.githubusercontent.com/9883156/343592231-07b69b47-352f-4fe6-aa73-37ee93033fe7.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAxMTY3MDEsIm5iZiI6MTcyMDExNjQwMSwicGF0aCI6Ii85ODgzMTU2LzM0MzU5MjIzMS0wN2I2OWI0Ny0zNTJmLTRmZTYtYWE3My0zN2VlOTMwMzNmZTcucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MDRUMTgwNjQxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MDY4MTI1NWUyODQ4YTdkYjYxZDc3MTE4ZWYzNmE0NWE3NTlkOWI0ZjJhMGEyNjNhODViMGE4Y2Q0MmI2Yjc0OSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.kwF7wUDLuFNcjM-3OaRHjEwWR-MeVh7Et9UHVC825HY)
![image](https://private-user-images.githubusercontent.com/9883156/343592264-ec0b2884-4a88-4124-a6d5-0520a82f9222.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAxMTY3MDEsIm5iZiI6MTcyMDExNjQwMSwicGF0aCI6Ii85ODgzMTU2LzM0MzU5MjI2NC1lYzBiMjg4NC00YTg4LTQxMjQtYTZkNS0wNTIwYTgyZjkyMjIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MDRUMTgwNjQxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTQ1OTVhNTIzMTM0MzY5YTdkOTQ5MzQ0ZDFkNGNkZTk5NWM0YjRkZGM4ZTMyMTU2MDk4MTc2MjNhYjM3YWU5MCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.8CLN0PyY3TtzI9lZktGgty9KACXc28PARL_EEq-lxW8)
The text was updated successfully, but these errors were encountered: