Hackney checkout_failure which was working before 1.16.0 #689

SoniCoder · 2021-06-14T13:48:33Z

2021-06-14 13:47:38.633 [warning] <0.1002.0>@butler_client_hackney:request:121 HTTP Request unsuccessful: Reason checkout_failure, Retrying after 4000 Request: {"http://192.168.5.107:8080/api-gateway/auth-service/platform-auth/oauth/token",[{"Authorization","Basic ABCD"},{"Cache-Control","no-cache"}],"application/x-www-form-urlencoded","grant_type=client_credentials&client_id=butlerx&client_secret=butlerx"} Method: post

This error goes away when I use: {use_default_pool, false}

benoitc · 2021-06-16T20:03:49Z

did you try latest master?

SoniCoder · 2021-06-17T02:57:03Z

Yes tried

…

On Thu, Jun 17, 2021, 1:34 AM Benoit Chesneau ***@***.***> wrote: did you try latest master? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#689 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACREQNKU5BRZ5TBSROVW2CDTTD7TJANCNFSM46VIYKYA> .

benoitc · 2021-06-17T08:40:50Z

can you send a trace of the request? Can you share a snippet? also ho many connections in the pool?

mattbaker · 2021-07-07T17:09:30Z

We started seeing this too with a connection pool size set to 400. We've been running f876781 since it was merged into master.

Unfortunately I don't have much useful information to share yet, but if we come up with something useful I will post it here. So far the only interesting thing was noticing that the message queue of the hackney_pool process started growing quickly at a linear rate so something was stopping it from responding to incoming messages. We saw memory usage on the node increase as well but that might just be the message queue size increasing.

y axis is message queue length, x is time:

The only other piece of vague data we have is that this might correlate with having very slow hosts responding to us. It's not consistent though, it only happens in select cases, and not to all nodes.

If we learn more we will let you know!

Thanks for all your hard work on this library @benoitc, I know being a sole maintainer is a very hard job, hopefully I will have something helpful after we try to reproduce the issue.

SoniCoder · 2021-07-07T17:34:38Z

I was facing this issue with the default hackney settings

mattbaker · 2021-07-07T20:16:28Z

Yeah, unfortunately we need the connection pool.

benoitc · 2021-07-07T21:14:13Z

@mattbaker Can you share a s trace and snippet of the code doing the request?

mattbaker · 2021-07-07T21:35:23Z

Yeah I'll work on finding an example, there's some abstraction that makes a simple copy-paste example hard

leonardb · 2021-09-21T15:53:23Z

@SoniCoder @mattbaker If you've not already solved this you may want to check if you have somehow pinned an older version of unicode_util_compat which does not contain the string_compat module. Unfortunately xref on your top-level project won't find that.

I was running into exactly this issue when upgrading hackney on a project and after debugging discovered I had a manual include of an older version which was causing exactly this checkout_failure.

Temporarily added extra trace point of the try/catch in hackney_pool:checkout/4 and the stack trace showed the missing module.

[hackney trace 60 <0.2606.0> 2021:09:21 15:44:07 4274] checkout failure 
   Content: [{module,hackney_pool},
             {line,71},
             {error,undef,
                 [{string_compat,to_lower,["google.com"],[]},
                  {hackney_connection,new,1,
                      [{file,
                           "....hackney_connection.erl"},
                       {line,32}]},
                  {hackney_pool,do_checkout,5,
                      [{file,
                           "....hackney_pool.erl"},
                       {line,97}]},
                  {hackney_pool,'-checkout/4-fun-0-',6,
                      [{file,
                           "....hackney_pool.erl"},
                       {line,69}]}]}]
[hackney trace 80 <0.2551.0> 2021:09:21 15:44:07 4274] connect error ~p 
   Content: [{module,hackney_connect},{line,233},{error,checkout_failure}]

mattbaker · 2021-09-21T16:51:24Z

I don't :/ Thanks though! We ended up switching to Finch

mjaow · 2021-11-11T11:25:53Z

we also saw checkout_failure when traffic go through proxy

wizardone · 2022-01-18T20:12:36Z

Just bumping this up. Started encountering this after an elixir upgrade.
Currently running on 1.10.4 and using httpoison 1.7.0, hackney 1.17.4
I found some vague options online and tried playing with downgrading/upgrading versions, but could not get it to work.
P.S. To clarify: I'm using httpoison, but I would assume the error bubbles up from hackney.

mruoss · 2022-03-13T19:48:36Z

Seeing this now after upgrading to OTP 24.3. Maybe related to erlang/otp#5783?

ankhers · 2022-10-26T15:39:51Z

I have been looking into this for the last couple weeks for a project at work. We are using 1.18.1, which is currently the latest release on hex.

The issue seems to be related to the hackney_connections gen server. I have measured the mailbox size (erlang:process_info(Pid, message_queue_len)) being over 500k messages. Killing the process and having it restart seems to temporarily "fix" the issue until it comes back again.

I can start a new ticket if needed, but can you @benoitc describe what exactly this process is meant for? I can't seem to find any reference to the id elsewhere in the code other than creating and setting it in the #connection record. Is this something planned for later? As it stands, this is happening before the pool gets involved, so this genserver is acting as a bottleneck for every request that is attempting to go through hackney.

We are compiling against OTP 25.1.2.

I would be more than happy to run any tests in order to help find and fix this issue. The problem is that I do not know how to trigger the issue.

ankhers · 2022-10-28T13:50:36Z

I believe we fixed our issue.

We had a function that looked something like this.

foo(A, B) ->
  case parse_string(A, B) of
    {ok, Val} -> Val;
    {error, {_, reason}} ->
      hackney:request ...
      
      parse_string(?DEFAULT_VALUE, B)
  end.

Essentially if the function returned the error variant, we were not matching on the actual error we cared about for the retry. So when we received a different error message, we were calling this function infinitely. And because of that, we were sending an infinite number of requests to hackney. Once we had multiple processes calling this function, it was filling up the hackney_connections mailbox.

For anyone that may be having this issue, the following is how I was able to find out where the issue was coming from.

To start, I needed to crawl through hackneys code in order to figure out what was actually calling the checkout_failure. In my case, I found out it was because the hackney_connections genserver was timing out. After this I took a look into that processes mailbox using the following code in a remote shell on the node that was having issues.

1> Pid = erlang:whereis(hackney_connections).
2> erlang:process_info(Pid, message_queue_len).

Here I found out the process had over 500k messages in the mailbox. So I took a look into the mailbox to see what those messages were.

3> {messages, Messages} = erlang:process_info(Pid, messages)

That dumped me out the mailbox at that particular point in time. I then used the following (this may not be exact, I don't have the code on my machine and I may be getting the format slightly wrong here)

4> Domains = lists:foldl(fun({'$gen_call', _, {lookup, Domain, _, _}}, Acc) -> 
  case maps:get(Domain, Acc) of
    {bad_key, _} -> maps:put(Domain, 1, Acc);
    Val -> maps:put(Domain Val + 1, Acc)
end, Messages).

Which gave me a map with the domains being the key and the value being the number of requests in the mailbox for that given domain.

At this point I had an idea of where the requests were going.

So I wrote a little gen server that could run and constantly checked the hackney_connections message queue length, and when it went over 1000 messages, I triggered recon_trace:calls({hackney, request, return_trace}, 100). This told me the information that was sent by hackney and it was fairly trivial to find in our code where this was being sent from. Which is where we noticed that recursive function mentioned above.

I hope this can help someone else find issues in the future.

benoitc self-assigned this Mar 14, 2022

benoitc added the bug label Mar 14, 2022

jjcarstens mentioned this issue Mar 15, 2022

Unhandled error: {:error, :checkout_failure} on OTP 24.3 when publishing firmware nerves-hub/nerves_hub_user_api#32

Closed

arjan mentioned this issue Nov 13, 2022

Agent download failure when using hackney_telemetry appsignal/appsignal-elixir#802

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hackney checkout_failure which was working before 1.16.0 #689

Hackney checkout_failure which was working before 1.16.0 #689

SoniCoder commented Jun 14, 2021 •

edited

Loading

benoitc commented Jun 16, 2021

SoniCoder commented Jun 17, 2021 via email

benoitc commented Jun 17, 2021

mattbaker commented Jul 7, 2021 •

edited

Loading

SoniCoder commented Jul 7, 2021

mattbaker commented Jul 7, 2021

benoitc commented Jul 7, 2021

mattbaker commented Jul 7, 2021

leonardb commented Sep 21, 2021 •

edited

Loading

mattbaker commented Sep 21, 2021

mjaow commented Nov 11, 2021

wizardone commented Jan 18, 2022 •

edited

Loading

mruoss commented Mar 13, 2022

ankhers commented Oct 26, 2022 •

edited

Loading

ankhers commented Oct 28, 2022 •

edited

Loading

Hackney checkout_failure which was working before 1.16.0 #689

Hackney checkout_failure which was working before 1.16.0 #689

Comments

SoniCoder commented Jun 14, 2021 • edited Loading

benoitc commented Jun 16, 2021

SoniCoder commented Jun 17, 2021 via email

benoitc commented Jun 17, 2021

mattbaker commented Jul 7, 2021 • edited Loading

SoniCoder commented Jul 7, 2021

mattbaker commented Jul 7, 2021

benoitc commented Jul 7, 2021

mattbaker commented Jul 7, 2021

leonardb commented Sep 21, 2021 • edited Loading

mattbaker commented Sep 21, 2021

mjaow commented Nov 11, 2021

wizardone commented Jan 18, 2022 • edited Loading

mruoss commented Mar 13, 2022

ankhers commented Oct 26, 2022 • edited Loading

ankhers commented Oct 28, 2022 • edited Loading

SoniCoder commented Jun 14, 2021 •

edited

Loading

mattbaker commented Jul 7, 2021 •

edited

Loading

leonardb commented Sep 21, 2021 •

edited

Loading

wizardone commented Jan 18, 2022 •

edited

Loading

ankhers commented Oct 26, 2022 •

edited

Loading

ankhers commented Oct 28, 2022 •

edited

Loading