Skip to content

Conversation

jmikedupont2
Copy link
Member

@jmikedupont2 jmikedupont2 commented Sep 13, 2025

User description

This PR implements the web spider and corpus builder for project analysis as part of CRQ-056.


PR Type

Enhancement


Description

  • Implement web spider tool for corpus building

  • Add URL extraction from markdown files

  • Create HTML content scraping functionality

  • Include integration tests and documentation


Diagram Walkthrough

flowchart LR
  A["Markdown Files"] --> B["URL Extractor"]
  B --> C["Web Spider"]
  C --> D["HTML Scraper"]
  D --> E["Corpus Files"]
  F["URL File"] --> C
Loading

File Walkthrough

Relevant files
Enhancement
main.rs
Core web spider implementation                                                     

tools/web_spider_corpus_builder/src/main.rs

  • Implement command-line argument parsing for markdown files and URL
    files
  • Add URL extraction from markdown using regex patterns
  • Create web scraping functionality with HTML content parsing
  • Include rate limiting and error handling for web requests
+108/-0 
Tests
integration_test.rs
Integration test suite                                                                     

tools/web_spider_corpus_builder/tests/integration_test.rs

  • Add integration test for corpus creation from markdown files
  • Test HTML content fetching and file generation
  • Include cleanup and assertion logic for test validation
+59/-0   
Documentation
CRQ-056-implement-web-spider-and-corpus-builder-for-project-analysis.md
CRQ specification document                                                             

docs/crq_standardized/CRQ-056-implement-web-spider-and-corpus-builder-for-project-analysis.md

  • Document CRQ requirements for web spider implementation
  • Define proposed solution with multiple Rust crates
  • Specify report generation and integration goals
+41/-0   
SOP_Web_Spider_Corpus_Building.md
Web spider SOP documentation                                                         

docs/sops/SOP_Web_Spider_Corpus_Building.md

  • Create standard operating procedure for web spider usage
  • Document tool functionality and configuration options
  • Include build instructions and testing information
+45/-0   
Configuration changes
Cargo.toml
Package configuration file                                                             

tools/web_spider_corpus_builder/Cargo.toml

  • Define package configuration for web spider tool
  • Add dependencies for HTTP requests, HTML parsing, and CLI
+13/-0   
Miscellaneous
example.com_.txt
Sample corpus output                                                                         

tools/web_spider_corpus_builder/corpus/web_sources/example.com_.txt

  • Sample output file from web spider execution
  • Contains extracted text content from example.com
+3/-0     

Copy link

coderabbitai bot commented Sep 13, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/CRQ-056-web-spider-corpus-builder

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 Security concerns

Resource exhaustion (DoS) risk:
The HTTP client is created without a timeout, so requests may hang indefinitely. Additionally, responses are fetched without size limits and then written to disk, which could allow large downloads to exhaust disk space. Consider setting a reasonable reqwest timeout, enforcing a maximum content length, and validating/limiting saved output sizes.

⚡ Recommended focus areas for review

Non-portable Test

The integration test uses a hard-coded absolute path to the built binary, which will fail on most environments/CI. Use cargo-run or assert_cmd/cargo_bin to locate and run the binary portably.

let run_output = Command::new("/data/data/com.termux.nix/files/home/pick-up-nix/source/github/meta-introspector/submodules/target/debug/web_spider_corpus_builder")
    .arg("-m")
    .arg(&test_md_file)
    .arg("-o")
    .arg(&test_output_dir)
    .output()
    .expect("Failed to run web_spider_corpus_builder");
println!("web_spider_corpus_builder stdout: {}", String::from_utf8_lossy(&run_output.stdout));
println!("web_spider_corpus_builder stderr: {}", String::from_utf8_lossy(&run_output.stderr));
assert!(run_output.status.success(), "Spider run failed: {:?}", run_output);
Data Overwrite

Files are written with fs::write to a deterministic name without checking for existence, contradicting the docs promise that existing files are never modified. This risks overwriting prior corpus files.

let file_name = sanitize(url.host_str().unwrap_or("unknown").to_string() + &url.path().replace('/', "_")) + ".txt";
let output_path = args.output_dir.join(file_name);

if let Some(ct) = content_type {
    if ct.contains("text/html") {
        let html_content = response.text()?;
        let document = Html::parse_document(&html_content);
        let selector = Selector::parse("p, h1").unwrap(); // Extract text from paragraph and heading tags
        let text_content: String = document.select(&selector)
            .map(|element| element.text().collect::<String>())
            .collect::<Vec<String>>()
            .join("\n");
        fs::write(&output_path, text_content)?;
        println!("Successfully wrote HTML content to: {}", output_path.display());
    } else if ct.contains("application/pdf") {
        // Handle PDF by just noting it, as direct text extraction is complex
        println!("  Skipping PDF: {}", url_str);
        fs::write(&output_path, "PDF content from: ".to_string() + &url_str)?;
    } else {
        // For other content types, just save raw bytes if desired, or skip
        println!("  Skipping unsupported content type ({}) : {}", ct, url_str);
        fs::write(&output_path, format!("Unsupported content type ({}) from: {}", ct, url_str))?;
    }
} else {
    println!("  No content type, skipping: {}", url_str);
    fs::write(&output_path, format!("No content type from: {}", url_str))?;
URL Extraction

The Markdown URL regex is overly restrictive (e.g., limits TLD length, struggles with parentheses/queries), likely missing valid URLs. Consider a more robust parser or relaxed patterns.

// Regex to find URLs in Markdown reference links: [text](url) or raw URLs
let re = Regex::new(r"(?i)\[[^\]]+\]\((https?://[^)]+\.[a-z]{2,6}(?:/[^)]*)?)\)|(https?://[^\s)]+\.[a-z]{2,6}(?:/[^\s)]*)?)")?;

for cap in re.captures_iter(&content) {
    if let Some(url_match) = cap.get(1).or_else(|| cap.get(2)) {
        urls.push(url_match.as_str().to_string());
    }
}

Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Make integration test portable

Remove the hardcoded binary path and rely on Cargo's provided test-time binary
path to make the test portable. Use the CARGO_BIN_EXE_
env to run the built
binary and drop the separate build step. This fixes failures on different
machines and CI environments.
*

tools/web_spider_corpus_builder/tests/integration_test.rs [23-39]

-// Build the spider executable
-let build_output = Command::new("cargo")
-    .arg("build")
-    .arg("-p")
-    .arg("web_spider_corpus_builder")
-    .output()
-    .expect("Failed to build web_spider_corpus_builder");
-assert!(build_output.status.success(), "Build failed: {:?}", build_output);
-
-// Run the spider executable with the dummy markdown file
-let run_output = Command::new("/data/data/com.termux.nix/files/home/pick-up-nix/source/github/meta-introspector/submodules/target/debug/web_spider_corpus_builder")
+// Run the spider executable with the dummy markdown file (portable path provided by Cargo)
+let bin_path = env!("CARGO_BIN_EXE_web_spider_corpus_builder");
+let run_output = Command::new(bin_path)
     .arg("-m")
     .arg(&test_md_file)
     .arg("-o")
     .arg(&test_output_dir)
     .output()
     .expect("Failed to run web_spider_corpus_builder");
  • Apply / Chat
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a hardcoded, user-specific path to the binary, which would cause the integration test to fail on any other machine, and provides the standard, portable solution.

High
Add timeout and sanitize input URLs

Prevent hangs by adding a reasonable HTTP timeout on the blocking client. Also
trim and skip empty/comment lines from URL files to avoid parse errors on blanks
and make input robust.

tools/web_spider_corpus_builder/src/main.rs [46-60]

 if let Some(url_file_path) = args.url_file {
     println!("Reading URLs from: {}", url_file_path.display());
     let content = fs::read_to_string(&url_file_path)?;
-    all_urls.extend(content.lines().map(|s| s.to_string()));
+    all_urls.extend(
+        content
+            .lines()
+            .map(str::trim)
+            .filter(|l| !l.is_empty() && !l.starts_with('#'))
+            .map(str::to_owned),
+    );
 } else {
     for md_file in &args.markdown_files {
         println!("Extracting URLs from: {}", md_file.display());
         let extracted = extract_urls_from_markdown(md_file)?;
         all_urls.extend(extracted);
     }
 }
 
 fs::create_dir_all(&args.output_dir)?;
 
-let client = Client::new();
+let client = Client::builder()
+    .timeout(std::time::Duration::from_secs(15))
+    .build()?;
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly points out that the HTTP client lacks a timeout, which can cause the program to hang indefinitely, and also improves input robustness by handling empty or commented-out lines.

Medium
Ensure unique output filenames

Avoid overwriting existing files to honor the "never modify existing files"
requirement. Generate a unique output path by appending a numeric suffix when a
filename already exists. This prevents data loss across multiple runs or
duplicate URLs.

tools/web_spider_corpus_builder/src/main.rs [72-73]

-let file_name = sanitize(url.host_str().unwrap_or("unknown").to_string() + &url.path().replace('/', "_")) + ".txt";
-let output_path = args.output_dir.join(file_name);
+let mut base_name = sanitize(url.host_str().unwrap_or("unknown").to_string() + &url.path().replace('/', "_"));
+if base_name.is_empty() {
+    base_name = "unknown_".to_string();
+}
+let mut counter = 0usize;
+let mut output_path = loop {
+    let candidate = if counter == 0 {
+        args.output_dir.join(format!("{base_name}.txt"))
+    } else {
+        args.output_dir.join(format!("{base_name}__{counter}.txt"))
+    };
+    if !candidate.exists() {
+        break candidate;
+    }
+    counter += 1;
+};
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that the current code can overwrite files, which contradicts the documented requirement in SOP_Web_Spider_Corpus_Building.md and could lead to silent data loss.

Medium
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant