Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option --crawl-replace-urls does not replace the crawled URLs #131

Open
VAdri opened this issue Oct 17, 2024 · 3 comments
Open

Option --crawl-replace-urls does not replace the crawled URLs #131

VAdri opened this issue Oct 17, 2024 · 3 comments

Comments

@VAdri
Copy link

VAdri commented Oct 17, 2024

The option --crawl-replace-urls indicates:

Replace URLs of saved pages with relative paths of saved pages on the filesystem

So if I understand correctly the HTML extracted by single-file should have all its URLs crawled with the option --crawl-links replaced by the file path on which they are exported.

However, when I try this command I get only the original URLs:

./single-file-x86_64-linux https://example.com --crawl-links=true --crawl-max-depth=1 --crawl-inner-links-only=false --crawl-replace-urls=true

I also tried this command from the README using the option --crawl-rewrite-rule but it did not work either:

./single-file-x86_64-linux https://www.wikipedia.org --crawl-links=true --crawl-inner-links-only=true --crawl-max-depth=1 --crawl-rewrite-rule="^(.*)\\?.*$ $1"

I was able to make it work on v2.0.0 but not since v2.0.2.

@gildas-lormeau
Copy link
Owner

In the first example, there are no inner links. The second example does not work anymore (I'm pretty sure it used to work in the past) because there are no link with a resolved URL starting with "https://www.wikipedia.org/" in the page.

@VAdri
Copy link
Author

VAdri commented Oct 22, 2024

Is it supposed to work only for inner links? Because I did put the option --crawl-inner-links-only=false in my first example.

But even with inner links only it doesn't do the trick apparently:

./single-file https://matklad.github.io/2024/09/23/what-is-io-uring.html --crawl-links=true --crawl-max-depth=1 --crawl-inner-links-only=true --crawl-replace-urls=true

@gildas-lormeau
Copy link
Owner

That was not working because --crawl-replace-urls written in lowercase did not work. I fixed this issue in the last version I've just published.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants