Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] chia says it fails to start daemon, but the process is running. chia can't connect to it. #11390

Open
cross opened this issue May 2, 2022 · 12 comments
Assignees
Labels
bug Something isn't working daemon

Comments

@cross
Copy link
Contributor

cross commented May 2, 2022

What happened?

After upgrading a number of my linux (Ubuntu 20.04.4) systems to 1.3.4, I have a few harvesters that are failing to start up. When I try to start the harvester:

$ chia start harvester
Daemon not started yet
Starting daemon
Daemon not started yet
Failed to create the chia daemon
$

This reports this error in only a couple of seconds. I've noted that if I check the process table after this I see chia_daemon, and netstat shows a listener on port 55400. Maybe something isn't waiting long enough?

These same systems were working with 1.3.3, and other systems are working with 1.3.4, so I don't know what to look at.

The debug.log shows:

2022-05-01T20:51:47.333 harvester chia.rpc.rpc_server     : WARNING  Cannot connect to daemon at ws://localhost:55400
2022-05-01T20:51:49.529 harvester chia.rpc.rpc_server     : WARNING  Cannot connect to daemon at ws://localhost:55400
2022-05-01T20:51:51.719 harvester chia.rpc.rpc_server     : WARNING  Cannot connect to daemon at ws://localhost:55400

but the above are from an hour or two ago, and new invocations aren't producing more of the same.

The systems displaying this behavior are older systems, and all (3-5) of these like systems are displaying this same problem. I rebooted one completely and it's still misbehaving in this way.

Let me know what other diagnostics or information I can gather.

Version

1.3.4

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

2022-05-01T20:51:47.333 harvester chia.rpc.rpc_server     : WARNING  Cannot connect to daemon at ws://localhost:55400
2022-05-01T20:51:49.529 harvester chia.rpc.rpc_server     : WARNING  Cannot connect to daemon at ws://localhost:55400
2022-05-01T20:51:51.719 harvester chia.rpc.rpc_server     : WARNING  Cannot connect to daemon at ws://localhost:55400
@cross cross added the bug Something isn't working label May 2, 2022
@cross
Copy link
Contributor Author

cross commented May 2, 2022

Oh!!! I just noticed something else different about this set of hosts. They are older hardware, but more importantly, don't have IPv4 internet connectivity (or a local IPv4 address, except for 127.0.0.1).

I think this is likely the core issue. I can see the daemon listening on 127.0.0.1:55400, but the fact that that should listen IPv6 is a separate defect I suspect. And, I can get to 127.0.0.1 just fine on these systems.

$ telnet localhost 55400
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
asjkdfhsajkd
Connection closed by foreign host.

@emlowe
Copy link
Contributor

emlowe commented May 2, 2022

By default the daemon will listen on "localhost", however, it will try to do a DNS lookup on the word localhost which should return 127.0.0.1 - since the daemon seems to be listening this would appear to be ok.

Note, you could try running the daemon in the foreground with chia run_daemon

@cross
Copy link
Contributor Author

cross commented May 2, 2022

Actually, with prefer_ipv6 set to True, which is it for me, it might resolve to ::1. But I tried it with prefer_ipv6 set to False as well, and it still fails in the same way.

Running the daemon in the foreground has the same problem. The daemon is running fine, but chia still thinks it's not.

I did find something that will work. I looked through the code to see how the daemon connection is made to test that it's up, and can see it uses self_hostname and daemon_port. I changed self_hostname to:

self_hostname: &self_hostname "127.0.0.1"

and with no other changes it all works now. So, something is limited in (a) the daemon code (which should listen on ::1 IMO, but that's likely a different bug) or the (b) daemon client code, which should try to connect to 127.0.0.1 when self_hostname is "localhost" at least when prefer_ipv4 is false. But, I don't know if code in that area using the resolver stuff that I implemented last fall, which I think would have that effect, but something clearly isn't.

@cross
Copy link
Contributor Author

cross commented May 2, 2022

aiohttp.client.ClientSession.ws_connect("wss://localhost:55400" raises an exception:

Cannot connect to host localhost:55400 ssl:<ssl.SSLContext object at 0x7f68c0fab4c0> [Connect call failed ('::1', 55400, 0, 0)]

As expected since it's in aiohttp, the setting of the chia prefer_ipv6 parameter has no effect on this. I'm not sure why it seems to be trying IPv6 on these hosts of mine, but not on others, but perhaps the lack of an IPv4 external address or default route is influencing it internally? It's a question for aiohttp.

@cross
Copy link
Contributor Author

cross commented May 2, 2022

Also, I suggest that the Exception occurring in DaemonProxy.start() should cause a message to be logged. I'm not sure why it isn't ending up anywhere higher up the chain.

@cross
Copy link
Contributor Author

cross commented May 2, 2022

Also, I suggest that the Exception occurring in DaemonProxy.start() should cause a message to be logged. I'm not sure why it isn't ending up anywhere higher up the chain.

Looks like connect_to_daemon_and_validate() is getting the exception, and unless printing a simple message (if not quiet), it just returns None. I feel that if connect_to_daemon returns an exception, it should be logged...

@emlowe
Copy link
Contributor

emlowe commented May 2, 2022

I do think there are some bugs listening on ::1 - and it's possible connect_to_daemon_and_validate has issues as well. Thanks for some of the detailed debugging

@cross
Copy link
Contributor Author

cross commented May 2, 2022

Yup. (as is likely obvious) I cloned the tree onto one of these machines, so I can help probe more if you have suggestions/questions. For now I have the config set to self_hostname: 127.0.0.1 so I'm working, but happy to change that around to test things.
Thanks.

@emlowe
Copy link
Contributor

emlowe commented May 2, 2022

You are correct that daemon doesn't use prefer_ipv4 or any of that name resolution code.
It passes the hostname in config directly into TCPSite (see daemon/server.py)
it is strange that passing in the same value into TCPSite and ws_connect doesn't just work.

@cross
Copy link
Contributor Author

cross commented May 2, 2022

You are correct that daemon doesn't use prefer_ipv4 or any of that name resolution code.
It passes the hostname in config directly into TCPSite (see daemon/server.py)
it is strange that passing in the same value into TCPSite and ws_connect doesn't just work.

Stranger yet is that later I realized that on this Ubuntu system, the hosts file lists ::1 as ip6-localhost. I was unable with command-line queries to even figure out how aiohttp ended up trying ::1 starting with "wss://localhost:55400".... I kept getting back 127.0.0.1, including from calls to socket.gethostbyname and socket.getaddrinfo inside of daemon/client.py.

@github-actions
Copy link
Contributor

This issue has not been updated in 14 days and is now flagged as stale. If this issue is still affecting you and in need of further review, please comment on it with an update to keep it from auto closing in 7 days.

@github-actions github-actions bot added the stale-issue flagged as stale and will be closed in 7 days if not updated label May 20, 2022
@cross
Copy link
Contributor Author

cross commented May 21, 2022

This should remain open. This issue should track that connect_to_daemon_and_validate() is not logging/displaying an error if something fails

@cross cross changed the title [Bug] chia says it fails to start daemon, but the process is running. chia isn't able to recognize it [Bug] chia says it fails to start daemon, but the process is running. chia can't connect to it. May 21, 2022
@github-actions github-actions bot removed the stale-issue flagged as stale and will be closed in 7 days if not updated label May 21, 2022
@emlowe emlowe self-assigned this May 24, 2022
@emlowe emlowe removed the harvester label Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working daemon
Projects
None yet
Development

No branches or pull requests

3 participants