Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queued links completely ignored. #256

Open
shroom00 opened this issue Jan 30, 2025 · 3 comments
Open

Queued links completely ignored. #256

shroom00 opened this issue Jan 30, 2025 · 3 comments

Comments

@shroom00
Copy link

With the following example, only the initial url (wikipedia.org) is crawled. I've used the blacklist to not crawl the entirety of wikipedia (technically no pages except the initial one). However, the urls being sent to the queue are ignored, I'm unsure why.

extern crate env_logger;
extern crate spider;

use env_logger::Env;
use spider::tokio;
use spider::website::Website;

#[tokio::main]
pub async fn main() {
    let env = Env::default()
        .filter_or("RUST_LOG", "info")
        .write_style_or("RUST_LOG_STYLE", "always");

    env_logger::init_from_env(env);

    let urls = [
        "https://wikipedia.org",
        "https://google.com",
        "https://facebook.com",
    ];
    let mut website = Website::new(urls[0])
        .with_depth(2)
        .with_limit(10)
        .with_external_domains(Some([String::from("*")].into_iter()))
        .with_blacklist_url(Some(vec![
            "wik".into()
        ]))
        .build()
        .unwrap();

    let mut g = website.subscribe_guard().unwrap();
    let mut rx2: spider::tokio::sync::broadcast::Receiver<spider::page::Page> =
        website.subscribe(1).unwrap();
    let q = website.queue(100).unwrap();

    let _ = tokio::spawn(async move {
        let mut first = true;

        while let Ok(res) = rx2.recv().await {
            if first {
                first = false;
                for url in &urls[1..] {
                    q.send(url.to_string()).unwrap();
                    println!("sent {} to queue.", &url);
                }
            }
            g.inc();
        }
    });

    website.crawl().await;
    website.unsubscribe();
}
@shroom00
Copy link
Author

Due to the queue working in #257, I wonder if this is happening only in instances where links are found on the initial page or if it's some sort of race condition? Either way, it seems conditional. It's also interesting to note that in the referenced issue, the first url sent to the queue is skipped entirely, but the one after is visited.

@j-mendez
Copy link
Member

@shroom00 I would try to remove the println! and use stdout async from tokio. Can you try to add a small tokio::sleep before g.inc?

@shroom00
Copy link
Author

shroom00 commented Feb 1, 2025

@shroom00 I would try to remove the println! and use stdout async from tokio. Can you try to add a small tokio::sleep before g.inc?

I removed the print entirely (spider's tokio doesn't have the io-std feature enabled, and it wasnt so necessary I felt like adding to my toml file) and added a delay of 25 milliseconds, but it still didn't work. My terminal output is as follows:
[2025-02-01T22:50:37Z INFO spider::utils] fetch https://wikipedia.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants