Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache expiration #99

Open
anewuser opened this issue Jul 23, 2022 · 4 comments
Open

Cache expiration #99

anewuser opened this issue Jul 23, 2022 · 4 comments

Comments

@anewuser
Copy link
Contributor

anewuser commented Jul 23, 2022

When pipes have download blocks or are too slow to process, consider caching the output feed for a (much) longer time. You can also add these tags for them to suggest feed readers not to update them too often.

I'd be fine with it if my feeds that fit this description were automatically cached for three days or even longer to save everyone's bandwidth and server resources.

This could also be added as an option for us to manually mark pipes that don't need to be updated for a long time. I have some pipes with download blocks that only really need to be checked once a month.

@onli
Copy link
Member

onli commented Jul 29, 2022

Hm, I like the idea. We would need another caching layer (the third), for a specific download block (so not per url, as in https://github.com/pipes-digital/pipes/blob/master/downloader.rb). And then a way to invalidate that cache.

Or indeed as a per-pipe option, changing the cache logic in

pipes/pipe.rb

Line 85 in ea379d5

def run(mode: :xml)
.

@anewuser
Copy link
Contributor Author

anewuser commented Aug 7, 2022

Something else related to this: when a pipe is configured to download two or more URLs from the same domain in a row (as in a ForEach+Download block), it seems that all connections are made as fast as possible, which can trigger DoS protections automatically. It'd be interesting to add a timer between each connection to avoid that. The pipe requests would look even more organic if the timer changed randomly, as in this Tab Reloader option:

ad

@onli
Copy link
Member

onli commented Aug 7, 2022

It's not supposed to be as fast as possible. The downloads are put into a ThrottleQueue, divided by domain. See https://github.com/pipes-digital/pipes/blob/master/downloader.rb#L21-L25:

@@limiters[url.host] = ThrottleQueue.new 0.4 if ! @@limiters[url.host]
result = ""
@@limiters[url.host].foreground(rand) {
    result = _get(url, js)
}

If that leads to as fast as possible in a foreach, that would be a bug :/

@anewuser
Copy link
Contributor Author

anewuser commented Aug 7, 2022

This is one of the blogs that stopped my pipe with captchas when I was trying different combinations of Pipes blocks: https://pastebin.mozilla.org/kN5HgRk6

The problematic pipe was downloading the latest 3 or 4 posts. As I kept making changes and previewing them, the blog rightfully detected the pipe as a bot.

Another note on caching everything more aggressively: this doesn't need to be done with pipes that only connect to feedburner.com or youtube.com, since Google is unlikely to limit Pipes or suffer because of it. I've also started creating FeedBurner proxies for all of my feeds that go through download blocks as way to lower down the number of Pipes requests to their domains.

This site, on the other hand, can barely handle its human visitors, so the less often Pipes downloads its front page with the pipe I have for it, the better:

error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants