Queued links completely ignored. #256

shroom00 · 2025-01-30T10:31:29Z

With the following example, only the initial url (wikipedia.org) is crawled. I've used the blacklist to not crawl the entirety of wikipedia (technically no pages except the initial one). However, the urls being sent to the queue are ignored, I'm unsure why.

extern crate env_logger;
extern crate spider;

use env_logger::Env;
use spider::tokio;
use spider::website::Website;

#[tokio::main]
pub async fn main() {
    let env = Env::default()
        .filter_or("RUST_LOG", "info")
        .write_style_or("RUST_LOG_STYLE", "always");

    env_logger::init_from_env(env);

    let urls = [
        "https://wikipedia.org",
        "https://google.com",
        "https://facebook.com",
    ];
    let mut website = Website::new(urls[0])
        .with_depth(2)
        .with_limit(10)
        .with_external_domains(Some([String::from("*")].into_iter()))
        .with_blacklist_url(Some(vec![
            "wik".into()
        ]))
        .build()
        .unwrap();

    let mut g = website.subscribe_guard().unwrap();
    let mut rx2: spider::tokio::sync::broadcast::Receiver<spider::page::Page> =
        website.subscribe(1).unwrap();
    let q = website.queue(100).unwrap();

    let _ = tokio::spawn(async move {
        let mut first = true;

        while let Ok(res) = rx2.recv().await {
            if first {
                first = false;
                for url in &urls[1..] {
                    q.send(url.to_string()).unwrap();
                    println!("sent {} to queue.", &url);
                }
            }
            g.inc();
        }
    });

    website.crawl().await;
    website.unsubscribe();
}

shroom00 · 2025-01-30T10:52:27Z

Due to the queue working in #257, I wonder if this is happening only in instances where links are found on the initial page or if it's some sort of race condition? Either way, it seems conditional. It's also interesting to note that in the referenced issue, the first url sent to the queue is skipped entirely, but the one after is visited.

j-mendez · 2025-01-30T15:50:53Z

@shroom00 I would try to remove the println! and use stdout async from tokio. Can you try to add a small tokio::sleep before g.inc?

shroom00 · 2025-02-01T22:55:10Z

@shroom00 I would try to remove the println! and use stdout async from tokio. Can you try to add a small tokio::sleep before g.inc?

I removed the print entirely (spider's tokio doesn't have the io-std feature enabled, and it wasnt so necessary I felt like adding to my toml file) and added a delay of 25 milliseconds, but it still didn't work. My terminal output is as follows:
[2025-02-01T22:50:37Z INFO spider::utils] fetch https://wikipedia.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queued links completely ignored. #256

Queued links completely ignored. #256

shroom00 commented Jan 30, 2025

shroom00 commented Jan 30, 2025

j-mendez commented Jan 30, 2025

shroom00 commented Feb 1, 2025

Queued links completely ignored. #256

Queued links completely ignored. #256

Comments

shroom00 commented Jan 30, 2025

shroom00 commented Jan 30, 2025

j-mendez commented Jan 30, 2025

shroom00 commented Feb 1, 2025