Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web Scraping pipeline slightly updated #44

Merged
merged 7 commits into from
Oct 22, 2024

Conversation

rmusser01
Copy link
Collaborator

Actual parsing pipeline not adjusted, but I did add some extra features to it in anticipation:

Custom cookie support
Bookmark file upload/parsing for URLs to scrape
Refactored the scraping into one library
re-did the summarization so its now using the normal summarization/analysis pipeline.
It actually works now :p

Next up for it will be looking at improving the pipeline and also spoofing the UA/making it easier to bypass bot blocks.
For improving the pipeline, looking at using https://github.com/rmusser01/dom-to-semantic-markdown , once I do some more testing/confirm its working as expected.

@rmusser01 rmusser01 merged commit 5532fa2 into the-crypt-keeper:main Oct 22, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant