[Bug]: Error creating WACZ #2095

prestonvanloon · 2024-10-01T14:44:34Z

Browsertrix Version

v1.11.7-7a61568

What did you expect to happen? What happened instead?

I am having some DNS issues, probably from resource exhaustion. (Also filed #2094 to allow cpu_limits on crawler)

Error: getaddrinfo EAI_AGAIN my-minio-domain
    at GetAddrInfoReqWrap.onlookupall [as oncomplete] (node:dns:120:26)

When I see this error, the entire crawl is lost and that is frustrating when the crawl has run for 24 hours. I wish that the WACZ upload was attempted multiple times until the upload eventually completes or some threshold is met.

Reproduction instructions

Not sure. I'm using kind 0.24.0 . The cluster conflg is standard, just opens the nodeport.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraPortMappings:
  - containerPort: 30870
    hostPort: 8080
    listenAddress: "0.0.0.0"
    protocol: TCP

I'm using an external minio s3 instance. The minio s3 instance has to be behind HTTPS for replays to work, so I cannot provide the IP address.

Screenshots / Video

No response

Environment

No response

Additional details

I've tried every workaround that I could imagine.

The text was updated successfully, but these errors were encountered:

prestonvanloon · 2024-10-01T14:46:03Z

#1137 might be a solution, if that feature request was implemented.

prestonvanloon · 2024-10-02T16:46:23Z

After a bit of reverse engineering, I found an undocumented s3 field access_endpoint_url. With this, I was able to set the endpoint_url to a http://$IP:$PORT such that the DNS does not need to be resolved for uploading WACZ files. Of course, this is incompatible with replays since it is not HTTPS and it's not feasible to obtain a SSL cert for a non-public $IP:$PORT. Then I found access_endpoint_url which I was able to set with the domain name https://domain:port/bucket and this is a sufficient workaround for me.

I think there should be more than 1 attempt to upload the WACZ and if an upload of WACZ ultimately fails, then abort the rest of the crawl since the crawl data is lost.

ikreymer · 2024-10-03T05:01:55Z

Yes, the access_endpoint_url is designed for something like this. It would be odd that the minio instance is not being found, while the crawler is able to run

Re: dns issue, I'd be surprised if its anything related to resource exhausition - the upload happens when the browser is already shut down generally. Can the crawler find the DNS when it starts running? You can exec in the crawler and see if it can reach the minio node. Probably what we should do is check that the upload endpoint is available when starting the crawl, and fail immediately it is not - we'll probably add this (in the crawler repo).

I believe the crawler pod should be retrying a few times, so it should be retrying automatically - likely the DNS issue is not resolved, so it'll keep failing.

prestonvanloon added the bug Something isn't working label Oct 1, 2024

prestonvanloon mentioned this issue Oct 1, 2024

[Feature]: Allow nameserver override for crawler pods #1137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Error creating WACZ #2095

[Bug]: Error creating WACZ #2095

prestonvanloon commented Oct 1, 2024

prestonvanloon commented Oct 1, 2024

prestonvanloon commented Oct 2, 2024 •

edited

Loading

ikreymer commented Oct 3, 2024

[Bug]: Error creating WACZ #2095

[Bug]: Error creating WACZ #2095

Comments

prestonvanloon commented Oct 1, 2024

Browsertrix Version

What did you expect to happen? What happened instead?

Reproduction instructions

Screenshots / Video

Environment

Additional details

prestonvanloon commented Oct 1, 2024

prestonvanloon commented Oct 2, 2024 • edited Loading

ikreymer commented Oct 3, 2024

prestonvanloon commented Oct 2, 2024 •

edited

Loading