Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Error creating WACZ #2095

Open
prestonvanloon opened this issue Oct 1, 2024 · 3 comments
Open

[Bug]: Error creating WACZ #2095

prestonvanloon opened this issue Oct 1, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@prestonvanloon
Copy link

Browsertrix Version

v1.11.7-7a61568

What did you expect to happen? What happened instead?

I am having some DNS issues, probably from resource exhaustion. (Also filed #2094 to allow cpu_limits on crawler)

Error: getaddrinfo EAI_AGAIN my-minio-domain
    at GetAddrInfoReqWrap.onlookupall [as oncomplete] (node:dns:120:26)

When I see this error, the entire crawl is lost and that is frustrating when the crawl has run for 24 hours. I wish that the WACZ upload was attempted multiple times until the upload eventually completes or some threshold is met.

Reproduction instructions

Not sure. I'm using kind 0.24.0 . The cluster conflg is standard, just opens the nodeport.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraPortMappings:
  - containerPort: 30870
    hostPort: 8080
    listenAddress: "0.0.0.0"
    protocol: TCP

I'm using an external minio s3 instance. The minio s3 instance has to be behind HTTPS for replays to work, so I cannot provide the IP address.

Screenshots / Video

No response

Environment

No response

Additional details

I've tried every workaround that I could imagine.

@prestonvanloon prestonvanloon added the bug Something isn't working label Oct 1, 2024
@prestonvanloon
Copy link
Author

#1137 might be a solution, if that feature request was implemented.

@prestonvanloon
Copy link
Author

prestonvanloon commented Oct 2, 2024

After a bit of reverse engineering, I found an undocumented s3 field access_endpoint_url. With this, I was able to set the endpoint_url to a http://$IP:$PORT such that the DNS does not need to be resolved for uploading WACZ files. Of course, this is incompatible with replays since it is not HTTPS and it's not feasible to obtain a SSL cert for a non-public $IP:$PORT. Then I found access_endpoint_url which I was able to set with the domain name https://domain:port/bucket and this is a sufficient workaround for me.

I think there should be more than 1 attempt to upload the WACZ and if an upload of WACZ ultimately fails, then abort the rest of the crawl since the crawl data is lost.

@ikreymer
Copy link
Member

ikreymer commented Oct 3, 2024

Yes, the access_endpoint_url is designed for something like this. It would be odd that the minio instance is not being found, while the crawler is able to run

Re: dns issue, I'd be surprised if its anything related to resource exhausition - the upload happens when the browser is already shut down generally. Can the crawler find the DNS when it starts running? You can exec in the crawler and see if it can reach the minio node. Probably what we should do is check that the upload endpoint is available when starting the crawl, and fail immediately it is not - we'll probably add this (in the crawler repo).

I believe the crawler pod should be retrying a few times, so it should be retrying automatically - likely the DNS issue is not resolved, so it'll keep failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Triage
Development

No branches or pull requests

2 participants