Add recursive option #78

styfle · 2020-12-14T15:26:17Z

It would be nice to pass a URL and have it crawl the entire website recursively looking for dead links.

In order to avoid crawling the entire internet, it should stop recursing once a request no longer matches the original domain.

mre · 2020-12-14T21:25:58Z

Yeah it has been discussed a few times already and it's high on the todo-list.
Will require quite some restructuring I guess.
Right now the flow is

main -> extractor -> channel -> client (link checker) -> main

but there is no connection back to the extractor.

frederickjh · 2021-03-24T14:12:20Z

The lack of recursive spidering make this project unusable for my purpose of checking all links internal and external on a website. I am trying to find a replacement for michaeltelford/broken_link_finder . It is written in Ruby and with out super user access a the new place where this will run it is impossible to install. I am looking for a "portable" replacement. I worked with michaeltelford to get his project into a much more usable state. Check out that projects issue queue for some of the reasoning that went into the development.

In any case regarding this issue. The spidering should stop with links found on the current domain, but the links found to external sources should still be checked.

mre · 2021-03-24T23:52:29Z

#165 is getting very close to completion. It implements the functionality described. If you like to support, please build the version from that branch and test it. Feedback on the pull request is appreciated.

frederickjh · 2021-03-25T09:16:18Z

@mre I am new to rust but it seemed pretty straight forward as to how to build from Working on an Existing Cargo Package. However I have run into an issue. At first I thought it was a credential issue but it looks like a 404 issue.

Caused by:
Unable to update https://github.com/amaurym/async-smtp?branch=am-fast-socks#eac57391

That GitHub URL returns a 404 so I am not sure how to proceed to build.

Rust information:

$ rustc --version
rustc 1.50.0 (cb75ad5db 2021-02-10)
$ rustup --version
rustup 1.23.1 (3df2264a9 2020-11-30)
info: This is the version for the rustup toolchain manager, not the rustc compiler.
info: The currently active `rustc` version is `rustc 1.50.0 (cb75ad5db 2021-02-10)`
$ cargo --version
cargo 1.50.0 (f04e7fab7 2021-02-04)

The GitHub repository amaurym/async-smtp does not seem to exist anymore.
Not sure how to proceed. Please advise. Thanks!

frederickjh · 2021-03-25T09:25:22Z

I dug through the Cargo.lock file for reacherhq/check-if-email-exists and found this line with the source for async-smtp

source = "git+https://github.com/async-email/async-smtp?branch=master#0f1c4c6a565833f8c7fc314de84c4cbbc8da2b4a"

So it looks like the source for async-smtp has moved to async-email/async-smtp.

frederickjh · 2021-03-25T09:35:58Z

Just found #189 which is the same issue I reported here as to why the build fails.

mre · 2021-03-25T12:02:46Z

Yeah. Related:

We are blocked by upstream at the moment. 😕

frederickjh · 2021-03-25T13:44:03Z

Ah, upstream reacherhq/check-if-email-exists updated to use upstream instead of his fork three days ago. I am guessing he also deleted the fork then, but this project is still using it.

See here: chore: Update wording around licenses #892

This repository still has references to his fork that no longer exists on both the master and simple-recursion branches in the Cargo.lock files. There is a comment in the Cargo.toml file that says:

# Switch back to version on crates.io after
# https://github.com/async-email/async-smtp/pull/36
# is merged and a new version of check-if-email-exists is released

So it looks like pull 36 in the upstream is closed but the new crate has not been published as the newest one is dated January 10.

@mre let me know if there is any movement on this and I will then try to build from the simple-recursion branch and test.

frederickjh · 2021-04-13T08:32:45Z

Upstream is reporting:

This is fixed in 0.8.21

but I still cannot bulid from the simple-recursion branch, so I think something needs work there too be for this will build.

frederickjh · 2021-04-14T07:37:27Z

So, I think that the version of async-email/async-smtp needs to be upgraded from 0.8.19 to 0.8.21 for the simple-recursion branch to build.

mre · 2021-04-14T09:21:49Z

Thanks for the info. I'll tackle that once #208 is merged. 😄

frederickjh · 2021-05-07T15:12:12Z

@mre I see that #208 got merged back in April. Let me know if you get this branch to the point where it will build and I can then test it.

frederickjh · 2021-05-19T09:29:13Z

@mre I am still willing to test this but I will be finished with my current job in second week of June and may not have a need for it after that for a while. I would like to get this setup to replace the current program that we are using to check for broken links that I cannot easily move to a shared server because it is ruby. Let me know if you get the version of simple-recursion changed so that this branch builds and I will test it.

mre · 2021-05-19T11:58:43Z

Thanks for your patience. Want to work on this as soon as I find the time. No guarantees this will be soon, though. 😅

frederickjh · 2021-06-01T12:45:02Z

@mre Patience I have, but time is running out. I finish at my current work place on June 9. I had hope to use this to replace a Ruby broken link checker that is running on an in-house server I need to decommission. I need something I can run on shared hosting with out needing to install a bunch of dependencies I don't have permission to do so.
So, if you could find a little time this week to get the branch in a shape that it will build, I could build and test it .

untitaker · 2021-11-14T15:28:41Z

I found that muffet and linkcheck serve the recursive usecase the best right now, and particularly muffet is very fast at this. What neither does is to opportunistically check /sitemap.xml to traverse the site faster/get to efficient parallelization faster. Lychee could one-up them on performance if that is done by default.

mre · 2022-02-04T11:36:57Z

New PR which tackles this: #465
Will probably go through another round of refactoring before it's ready, but I'm on it.

cipriancraciun · 2022-03-10T07:55:00Z

I'm unsure how this is actually implemented, perhaps what I am about to say is already covered, so sorry for the duplication.

Recursion is also very important to me, but I would like to allow the user to specify a list of origins (scheme+host+port) to allow recursion for, or a list of regular expressions.

Say for example one has both a www and a blog, site but also docs site, one would like to primarily check the www and blog (thus specifying them as arguments), but also to recursively check everything that links towards the docs site and other pages from that on any of the three.

mre · 2022-03-10T08:50:54Z

Good point. It's not implemented and wasn't mentioned before.

The way I envisioned it was that all links which belong to the same input URI would be followed recursively, while the rest would not. So you could do lychee --recursive www blog docs, but it sounds like you only want to check that links pointing to docs, but not all of docs. I wonder what's the issue to check all links in docs, though. Is the site too big?
If you want to exclude some URI patterns for docs, you could do lychee --exclude docs/foo --recursive www blog docs.

cipriancraciun · 2022-03-11T10:28:02Z

So you could do lychee --recursive www blog docs, but it sounds like you only want to check that links pointing to docs, but not all of docs. I wonder what's the issue to check all links in docs, though. Is the site too big?

Imagine that instead of docs there is actually an assets domain, that might not even have an index to start from; however this assets domain could contain some HTML files that are perhaps included in <iframe> inside www or blog, and some of these HTML files are somewhat self-contained, thus starting from them one wouldn't reach the entire assets collection. Now if one of these HTML files actually contains broken links, that could affect the initial www and blog sites.

Or another example, that extra domain could be something hosting examples HTMLs that might be linked from the main site, and one would like to make sure that every example works as expected.

Or if the above reasons don't seem convincing enough (granted they are quite extreme), I assume that inside the code there already exists a set of "allowed" domains or origins for recursion, that are filled in at startup based on the starting links; thus allowing the user to manipulate that wouldn't be much of a burden, but will also increase flexibility.

mre · 2022-03-11T11:41:22Z

Hm... the main question is always how to wire that up in the CLI without provoking additional mental overhead.
Following our --exclude/--include patterns, we could add an --include-recursive parameter:

lychee --recursive --include-recursive docs -- www blog docs

frittentheke · 2023-08-11T09:45:43Z

@mre could you maybe give an update on support for traversing / recursive link checking to tackle a whole website?
From what I could find #465 was the most recent attempt to get the design for this feature down?

mre · 2023-08-11T09:56:55Z

Yes, sure.

There were a few attempts, but there were always issues with the design. It's a feature which touches on almost all parts of the code and we have to get this right.

I'd love to dedicate more time to it, but it's hard to add that feature next to other responsibilities. Currently looking into companies who might be willing to sponsor that feature as I guess it will be quite some work, but it would have a very positive impact to the usefulness for all users. I know that there are companies out there which would really like to have that, but so far there hasn't been a lot of traction with regards to sponsoring.
My hope is to still get the free time to work on it at some point, but I wouldn't hold my breath for it right now unless there's a way to fund this. In the meantime, I encourage others to take a stab at it as well.

styfle · 2023-08-11T13:26:52Z

I'll close this since I already built a solution.

https://github.com/styfle/links-awakening

https://www.npmjs.com/package/links-awakening

mre · 2023-08-11T13:40:07Z

Nice package.
I would still like to keep the issue open, as I'd like to add recursion support to lychee at some point as well.

Alseenrodelap · 2023-11-23T19:12:40Z

Still no recursive option in the main branch since 2020? I'm trying to run this great program via Docker but really miss the recursive option...

lfrancke · 2024-01-05T13:59:58Z

I'm happy to offer a bounty of sorts of 100€ (payable via PayPal or SEPA) payment for whoever implements this, if multiple people work on it I'm happy to split the money.

I know this won't cover the whole development of this feature.

ewen-lbh · 2024-12-28T19:48:52Z

@lfrancke hiya, does your offer still stand?

lfrancke · 2024-12-28T19:55:04Z

Thanks for checking. Yes, it does.

lukehsiao · 2025-01-01T22:58:44Z

Somewhat tangential: I'm a big fan of lychee, and my workaround is to just link check all the links in a sitemap.

I have a cli tool for this: https://github.com/lukehsiao/sitemap2urllist

sitemap2urllist https://www.example.com/sitemap.xml --cache | xargs lychee --cache

which seems to serve pretty well.

Likely will be obsolete if/once lychee adds recursive support, but perhaps useful nonetheless.

ewen-lbh · 2025-01-03T15:43:29Z

Thanks for checking. Yes, it does.

I saw that even people familiar with the codebase failed after 3 attempts because of design issues so i must admit i'm a bit intimidated haha '^^

I guess a massively parallel recursive walk is the goal? @mre , was the v3 attempt closed because of a design issue, or just because it had bugs still and had drifted too much from main?

mre · 2025-01-03T17:43:54Z

Yeah, the third attempt changed too many things at the same time; I ran into issues and let the branch diverge too much, which made progress harder.

In my opinion, the first attempt, while simplistic, had the best chance to get merged. With that first attempt, I ran into a weird edge case where the pipeline would not terminate. It simply got stuck somewhere. Probably because my count of outstanding requests was off.

If you are willing to give it a try, feel free to take a look at the different versions to see which one you like best and pick up the work from there. Of course, a "clean room" implementation might also work. In fact, too much familiarity with the codebase might be a hindrance for finding a good solution.

In any case, thanks for looking into that! Good luck and if you have any questions, feel free to reach out here or send me an email so that we can chat about ideas or brainstorm a bit.

ewen-lbh · 2025-01-04T23:31:44Z

Ok so I implemented something, it's more of a PoC for now but it works

I just uuuuuh left implementation of a recursion depth limit for later and I, for funsies, tried on en.wikipedia.org and then std.rust-lang.org, with --max-concurrency 4096 ... currently on mobile internet data sharing with my wifi box rebooting because i literally kicked myself off the internet (literally, spotify did not load anymore, couldn't ssh, couldn't push to github etc)

the implementation might be a little too "efficient" lmao

EDIT: the wifi box reboot was successful, i might've overwhelmed its router lmao, i got my connection back up

ewen-lbh · 2025-01-04T23:33:13Z

theres also an issue with tokio never unlocking sends on the responses channel once they get filled up, a band aid is to bump --max-concurrency but it's not a real solution

mre · 2025-01-06T11:03:28Z

Great!

the implementation might be a little too "efficient" lmao

haha. 😆

a band aid is to bump --max-concurrency but it's not a real solution

You're right. Simply increasing max-concurrency is indeed just masking the underlying issue by giving more room before it manifests, rather than addressing why the channel isn't properly draining in the first place.

I wonder if it's just backpressure.

mre added the enhancement New feature or request label Dec 14, 2020

mre mentioned this issue Jan 15, 2021

Print scanned sources when using "--verbose" option #114

Closed

mre mentioned this issue Mar 1, 2021

Recursion Support (closes #55, #78) #165

Closed

6 tasks

jackfrancis mentioned this issue Mar 8, 2021

Move off of peter-evans/link-checker GH action kubereboot/kured#308

Closed

mre added the request-for-comments label Feb 4, 2022

cipriancraciun mentioned this issue Mar 10, 2022

Add support for skipping nofollow links #548

Closed

mre mentioned this issue Jul 1, 2022

Inputs with no content to parse are considered valid #665

Closed

styfle closed this as completed Aug 11, 2023

mre reopened this Aug 11, 2023

mre mentioned this issue Oct 17, 2023

External link checking - what am I missing? #1267

Closed

almereyda mentioned this issue Jul 4, 2024

Hard to find and validate a link degrowth/idn-website#26

Closed

bwbroersma mentioned this issue Sep 18, 2024

Auto check deadlinks internetstandards/Internet.nl#1463

Open

eecavanna mentioned this issue Dec 19, 2024

Website contains 3 broken links microbiomedata/issues#994

Closed

3 tasks

ewen-lbh linked a pull request Jan 4, 2025 that will close this issue

Implement recursion #1603

Draft

7 tasks

mre added this to the v1.0 milestone Jan 6, 2025

Add recursive option #78

Add recursive option #78

Comments

styfle commented Dec 14, 2020

mre commented Dec 14, 2020

frederickjh commented Mar 24, 2021

mre commented Mar 24, 2021

frederickjh commented Mar 25, 2021

frederickjh commented Mar 25, 2021

frederickjh commented Mar 25, 2021

mre commented Mar 25, 2021

frederickjh commented Mar 25, 2021

frederickjh commented Apr 13, 2021

frederickjh commented Apr 14, 2021

mre commented Apr 14, 2021

frederickjh commented May 7, 2021

frederickjh commented May 19, 2021

mre commented May 19, 2021

frederickjh commented Jun 1, 2021

untitaker commented Nov 14, 2021 • edited Loading

mre commented Feb 4, 2022

cipriancraciun commented Mar 10, 2022

mre commented Mar 10, 2022

cipriancraciun commented Mar 11, 2022

mre commented Mar 11, 2022

frittentheke commented Aug 11, 2023 • edited Loading

mre commented Aug 11, 2023

styfle commented Aug 11, 2023

mre commented Aug 11, 2023

Alseenrodelap commented Nov 23, 2023

lfrancke commented Jan 5, 2024

ewen-lbh commented Dec 28, 2024

lfrancke commented Dec 28, 2024

lukehsiao commented Jan 1, 2025 • edited Loading

ewen-lbh commented Jan 3, 2025

mre commented Jan 3, 2025

ewen-lbh commented Jan 4, 2025 • edited Loading

ewen-lbh commented Jan 4, 2025

mre commented Jan 6, 2025

untitaker commented Nov 14, 2021 •

edited

Loading

frittentheke commented Aug 11, 2023 •

edited

Loading

lukehsiao commented Jan 1, 2025 •

edited

Loading

ewen-lbh commented Jan 4, 2025 •

edited

Loading