CRQ-056: Implement Web Spider and Corpus Builder #27

jmikedupont2 · 2025-09-13T00:14:54Z

User description

This PR implements the web spider and corpus builder for project analysis as part of CRQ-056.

PR Type

Enhancement

Description

Implement web spider tool for corpus building
Add URL extraction from markdown files
Create HTML content scraping functionality
Include integration tests and documentation

Diagram Walkthrough

flowchart LR
  A["Markdown Files"] --> B["URL Extractor"]
  B --> C["Web Spider"]
  C --> D["HTML Scraper"]
  D --> E["Corpus Files"]
  F["URL File"] --> C

File Walkthrough

Relevant files

Enhancement

main.rs `Core web spider implementation` tools/web_spider_corpus_builder/src/main.rs Implement command-line argument parsing for markdown files and URL files Add URL extraction from markdown using regex patterns Create web scraping functionality with HTML content parsing Include rate limiting and error handling for web requests	+108/-0

Tests

integration_test.rs `Integration test suite` tools/web_spider_corpus_builder/tests/integration_test.rs Add integration test for corpus creation from markdown files Test HTML content fetching and file generation Include cleanup and assertion logic for test validation	+59/-0

Documentation

CRQ-056-implement-web-spider-and-corpus-builder-for-project-analysis.md `CRQ specification document` docs/crq_standardized/CRQ-056-implement-web-spider-and-corpus-builder-for-project-analysis.md Document CRQ requirements for web spider implementation Define proposed solution with multiple Rust crates Specify report generation and integration goals	+41/-0
SOP_Web_Spider_Corpus_Building.md `Web spider SOP documentation` docs/sops/SOP_Web_Spider_Corpus_Building.md Create standard operating procedure for web spider usage Document tool functionality and configuration options Include build instructions and testing information	+45/-0

Configuration changes

Cargo.toml `Package configuration file` tools/web_spider_corpus_builder/Cargo.toml Define package configuration for web spider tool Add dependencies for HTTP requests, HTML parsing, and CLI	+13/-0

Miscellaneous

example.com_.txt `Sample corpus output` tools/web_spider_corpus_builder/corpus/web_sources/example.com_.txt Sample output file from web spider execution Contains extracted text content from example.com	+3/-0

coderabbitai · 2025-09-13T00:15:00Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/CRQ-056-web-spider-corpus-builder

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-merge-pro · 2025-09-13T00:17:01Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 Security concerns Resource exhaustion (DoS) risk: The HTTP client is created without a timeout, so requests may hang indefinitely. Additionally, responses are fetched without size limits and then written to disk, which could allow large downloads to exhaust disk space. Consider setting a reasonable reqwest timeout, enforcing a maximum content length, and validating/limiting saved output sizes.
⚡ Recommended focus areas for review Non-portable Test The integration test uses a hard-coded absolute path to the built binary, which will fail on most environments/CI. Use cargo-run or assert_cmd/cargo_bin to locate and run the binary portably. let run_output = Command::new("/data/data/com.termux.nix/files/home/pick-up-nix/source/github/meta-introspector/submodules/target/debug/web_spider_corpus_builder") .arg("-m") .arg(&test_md_file) .arg("-o") .arg(&test_output_dir) .output() .expect("Failed to run web_spider_corpus_builder"); println!("web_spider_corpus_builder stdout: {}", String::from_utf8_lossy(&run_output.stdout)); println!("web_spider_corpus_builder stderr: {}", String::from_utf8_lossy(&run_output.stderr)); assert!(run_output.status.success(), "Spider run failed: {:?}", run_output); Data Overwrite Files are written with fs::write to a deterministic name without checking for existence, contradicting the docs promise that existing files are never modified. This risks overwriting prior corpus files. let file_name = sanitize(url.host_str().unwrap_or("unknown").to_string() + &url.path().replace('/', "_")) + ".txt"; let output_path = args.output_dir.join(file_name); if let Some(ct) = content_type { if ct.contains("text/html") { let html_content = response.text()?; let document = Html::parse_document(&html_content); let selector = Selector::parse("p, h1").unwrap(); // Extract text from paragraph and heading tags let text_content: String = document.select(&selector) .map(\|element\| element.text().collect::<String>()) .collect::<Vec<String>>() .join("\n"); fs::write(&output_path, text_content)?; println!("Successfully wrote HTML content to: {}", output_path.display()); } else if ct.contains("application/pdf") { // Handle PDF by just noting it, as direct text extraction is complex println!(" Skipping PDF: {}", url_str); fs::write(&output_path, "PDF content from: ".to_string() + &url_str)?; } else { // For other content types, just save raw bytes if desired, or skip println!(" Skipping unsupported content type ({}) : {}", ct, url_str); fs::write(&output_path, format!("Unsupported content type ({}) from: {}", ct, url_str))?; } } else { println!(" No content type, skipping: {}", url_str); fs::write(&output_path, format!("No content type from: {}", url_str))?; URL Extraction The Markdown URL regex is overly restrictive (e.g., limits TLD length, struggles with parentheses/queries), likely missing valid URLs. Consider a more robust parser or relaxed patterns. // Regex to find URLs in Markdown reference links: [text](url) or raw URLs let re = Regex::new(r"(?i)\[[^\]]+\]\((https?://[^)]+\.[a-z]{2,6}(?:/[^)])?)\)\|(https?://[^\s)]+\.[a-z]{2,6}(?:/[^\s)])?)")?; for cap in re.captures_iter(&content) { if let Some(url_match) = cap.get(1).or_else(\|\| cap.get(2)) { urls.push(url_match.as_str().to_string()); } }

qodo-merge-pro · 2025-09-13T00:19:46Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Make integration test portable Remove the hardcoded binary path and rely on Cargo's provided test-time binary path to make the test portable. Use the CARGO_BIN_EXE_ env to run the built binary and drop the separate build step. This fixes failures on different machines and CI environments.* tools/web_spider_corpus_builder/tests/integration_test.rs [23-39] -// Build the spider executable -let build_output = Command::new("cargo") - .arg("build") - .arg("-p") - .arg("web_spider_corpus_builder") - .output() - .expect("Failed to build web_spider_corpus_builder"); -assert!(build_output.status.success(), "Build failed: {:?}", build_output); - -// Run the spider executable with the dummy markdown file -let run_output = Command::new("/data/data/com.termux.nix/files/home/pick-up-nix/source/github/meta-introspector/submodules/target/debug/web_spider_corpus_builder") +// Run the spider executable with the dummy markdown file (portable path provided by Cargo) +let bin_path = env!("CARGO_BIN_EXE_web_spider_corpus_builder"); +let run_output = Command::new(bin_path) .arg("-m") .arg(&test_md_file) .arg("-o") .arg(&test_output_dir) .output() .expect("Failed to run web_spider_corpus_builder"); Apply / Chat Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a hardcoded, user-specific path to the binary, which would cause the integration test to fail on any other machine, and provides the standard, portable solution.	High
	Add timeout and sanitize input URLs Prevent hangs by adding a reasonable HTTP timeout on the blocking client. Also trim and skip empty/comment lines from URL files to avoid parse errors on blanks and make input robust. tools/web_spider_corpus_builder/src/main.rs [46-60] if let Some(url_file_path) = args.url_file { println!("Reading URLs from: {}", url_file_path.display()); let content = fs::read_to_string(&url_file_path)?; - all_urls.extend(content.lines().map(\|s\| s.to_string())); + all_urls.extend( + content + .lines() + .map(str::trim) + .filter(\|l\| !l.is_empty() && !l.starts_with('#')) + .map(str::to_owned), + ); } else { for md_file in &args.markdown_files { println!("Extracting URLs from: {}", md_file.display()); let extracted = extract_urls_from_markdown(md_file)?; all_urls.extend(extracted); } } fs::create_dir_all(&args.output_dir)?; -let client = Client::new(); +let client = Client::builder() + .timeout(std::time::Duration::from_secs(15)) + .build()?; Apply / Chat Suggestion importance[1-10]: 8 __ Why: The suggestion correctly points out that the HTTP client lacks a timeout, which can cause the program to hang indefinitely, and also improves input robustness by handling empty or commented-out lines.	Medium
	Ensure unique output filenames Avoid overwriting existing files to honor the "never modify existing files" requirement. Generate a unique output path by appending a numeric suffix when a filename already exists. This prevents data loss across multiple runs or duplicate URLs. tools/web_spider_corpus_builder/src/main.rs [72-73] -let file_name = sanitize(url.host_str().unwrap_or("unknown").to_string() + &url.path().replace('/', "_")) + ".txt"; -let output_path = args.output_dir.join(file_name); +let mut base_name = sanitize(url.host_str().unwrap_or("unknown").to_string() + &url.path().replace('/', "_")); +if base_name.is_empty() { + base_name = "unknown_".to_string(); +} +let mut counter = 0usize; +let mut output_path = loop { + let candidate = if counter == 0 { + args.output_dir.join(format!("{base_name}.txt")) + } else { + args.output_dir.join(format!("{base_name}__{counter}.txt")) + }; + if !candidate.exists() { + break candidate; + } + counter += 1; +}; Apply / Chat Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that the current code can overwrite files, which contradicts the documented requirement in `SOP_Web_Spider_Corpus_Building.md` and could lead to silent data loss.	Medium
More

mike added 2 commits September 12, 2025 23:26

CRQ-056: Implement Web Spider and Corpus Builder

ce8c6ef

Docs: Add .emacs.d/ to .gitignore

d9be4c9

qodo-merge-pro bot added Possible security concern Review effort 3/5 labels Sep 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CRQ-056: Implement Web Spider and Corpus Builder #27

CRQ-056: Implement Web Spider and Corpus Builder #27

Uh oh!

jmikedupont2 commented Sep 13, 2025 •

edited by qodo-merge-pro bot

Loading

Uh oh!

coderabbitai bot commented Sep 13, 2025

Review skipped

Uh oh!

qodo-merge-pro bot commented Sep 13, 2025

Uh oh!

qodo-merge-pro bot commented Sep 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CRQ-056: Implement Web Spider and Corpus Builder #27

Are you sure you want to change the base?

CRQ-056: Implement Web Spider and Corpus Builder #27

Uh oh!

Conversation

jmikedupont2 commented Sep 13, 2025 • edited by qodo-merge-pro bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

coderabbitai bot commented Sep 13, 2025

Review skipped

Uh oh!

qodo-merge-pro bot commented Sep 13, 2025

PR Reviewer Guide 🔍

Uh oh!

qodo-merge-pro bot commented Sep 13, 2025

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jmikedupont2 commented Sep 13, 2025 •

edited by qodo-merge-pro bot

Loading