Web Scraping pipeline slightly updated #44

rmusser01 · 2024-10-22T03:11:10Z

Actual parsing pipeline not adjusted, but I did add some extra features to it in anticipation:

Custom cookie support
Bookmark file upload/parsing for URLs to scrape
Refactored the scraping into one library
re-did the summarization so its now using the normal summarization/analysis pipeline.
It actually works now :p

Next up for it will be looking at improving the pipeline and also spoofing the UA/making it easier to bypass bot blocks.
For improving the pipeline, looking at using https://github.com/rmusser01/dom-to-semantic-markdown , once I do some more testing/confirm its working as expected.

Web Scraping pipeline slightly updated

rmusser01 added 7 commits October 21, 2024 18:22

Create Extract_Bookmark_URLs.py

bf72d1f

saving progress

ea10a80

more

95f8f1f

more

c1da921

Update Website_scraping_tab.py

81625d3

Delete Extract_Bookmark_URLs.py

978d4ec

Merge pull request #386 from rmusser01/dev

785b12a

Web Scraping pipeline slightly updated

rmusser01 merged commit 5532fa2 into the-crypt-keeper:main Oct 22, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web Scraping pipeline slightly updated #44

Web Scraping pipeline slightly updated #44

rmusser01 commented Oct 22, 2024

Web Scraping pipeline slightly updated #44

Web Scraping pipeline slightly updated #44

Conversation

rmusser01 commented Oct 22, 2024