-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Strip non-content tags, headers, footers #1
Comments
So, we've defaulted towards removing less, because (like you said) highly opinionated removal is risky and its easy to do further cleaning on the output with regex. Like the idea of readability as an option. Great suggestion! |
@oliviermills thank you for this. Just merged an option to remove non content tags. #14 This is just a start and I think there is room for other improvements here. |
Let me know if you have any feedback! |
Awesome, thanks @oliviermills! Will be checking it out soon. |
Fix FIRECRAWL_API_URL bug, also various PyLint fixes
Closing this one (#273 solves this issue). |
The markdown would be much more useful if you stripped headers/footers and other tags like filters etc that is not core content (i.e. low value for RAG/context). Either using tag or class-based removal from the html or using something like Mozilla's Readability or both! Highly opinionated class-based removal is risky but produces high value content and less noise.
For example a language selector in a header gets produced and should be stripped:
Here is a starter list.. should probably test against a couple thousand random pages and use an LLM like haiku with vision as judge.
The text was updated successfully, but these errors were encountered: