You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the following example, only the initial url (wikipedia.org) is crawled. I've used the blacklist to not crawl the entirety of wikipedia (technically no pages except the initial one). However, the urls being sent to the queue are ignored, I'm unsure why.
externcrate env_logger;externcrate spider;use env_logger::Env;use spider::tokio;use spider::website::Website;#[tokio::main]pubasyncfnmain(){let env = Env::default().filter_or("RUST_LOG","info").write_style_or("RUST_LOG_STYLE","always");
env_logger::init_from_env(env);let urls = ["https://wikipedia.org","https://google.com","https://facebook.com",];letmut website = Website::new(urls[0]).with_depth(2).with_limit(10).with_external_domains(Some([String::from("*")].into_iter())).with_blacklist_url(Some(vec!["wik".into()])).build().unwrap();letmut g = website.subscribe_guard().unwrap();letmut rx2: spider::tokio::sync::broadcast::Receiver<spider::page::Page> =
website.subscribe(1).unwrap();let q = website.queue(100).unwrap();let _ = tokio::spawn(asyncmove{letmut first = true;whileletOk(res) = rx2.recv().await{if first {
first = false;for url in&urls[1..]{
q.send(url.to_string()).unwrap();println!("sent {} to queue.",&url);}}
g.inc();}});
website.crawl().await;
website.unsubscribe();}
The text was updated successfully, but these errors were encountered:
Due to the queue working in #257, I wonder if this is happening only in instances where links are found on the initial page or if it's some sort of race condition? Either way, it seems conditional. It's also interesting to note that in the referenced issue, the first url sent to the queue is skipped entirely, but the one after is visited.
@shroom00 I would try to remove the println! and use stdout async from tokio. Can you try to add a small tokio::sleep before g.inc?
I removed the print entirely (spider's tokio doesn't have the io-std feature enabled, and it wasnt so necessary I felt like adding to my toml file) and added a delay of 25 milliseconds, but it still didn't work. My terminal output is as follows: [2025-02-01T22:50:37Z INFO spider::utils] fetch https://wikipedia.org
With the following example, only the initial url (wikipedia.org) is crawled. I've used the blacklist to not crawl the entirety of wikipedia (technically no pages except the initial one). However, the urls being sent to the queue are ignored, I'm unsure why.
The text was updated successfully, but these errors were encountered: