-
Notifications
You must be signed in to change notification settings - Fork 0
CRQ-056: Implement Web Spider and Corpus Builder #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/CRQ-059-wikipedia-wikidata-extractor
Are you sure you want to change the base?
CRQ-056: Implement Web Spider and Corpus Builder #27
Conversation
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
User description
This PR implements the web spider and corpus builder for project analysis as part of CRQ-056.
PR Type
Enhancement
Description
Implement web spider tool for corpus building
Add URL extraction from markdown files
Create HTML content scraping functionality
Include integration tests and documentation
Diagram Walkthrough
File Walkthrough
main.rs
Core web spider implementation
tools/web_spider_corpus_builder/src/main.rs
files
integration_test.rs
Integration test suite
tools/web_spider_corpus_builder/tests/integration_test.rs
CRQ-056-implement-web-spider-and-corpus-builder-for-project-analysis.md
CRQ specification document
docs/crq_standardized/CRQ-056-implement-web-spider-and-corpus-builder-for-project-analysis.md
SOP_Web_Spider_Corpus_Building.md
Web spider SOP documentation
docs/sops/SOP_Web_Spider_Corpus_Building.md
Cargo.toml
Package configuration file
tools/web_spider_corpus_builder/Cargo.toml
example.com_.txt
Sample corpus output
tools/web_spider_corpus_builder/corpus/web_sources/example.com_.txt