Skip to content
This repository has been archived by the owner on Sep 10, 2020. It is now read-only.

Crawler is running into terminal connection refused socket failures #4

Open
psivesely opened this issue May 18, 2016 · 5 comments
Open
Labels

Comments

@psivesely
Copy link
Contributor

psivesely commented May 18, 2016

Edit: see #4 (comment) for a better explanation and traceback. Don't know why this original report was so half-assed and lacked even the full traceback.

So the crawler is for the most part working very well. Where it runs into problems is what seems to be a Python IO/socket exception (Errno 111). Once it hits this error, it will fail the rest of the way through the crawl pretty instantaneously. See the log at the bottom of this post.

I believe that this is actually cause by a bug in Python3.5--see https://bugs.python.org/issue26402, but this warrants further testing. The PPA we've been using at https://launchpad.net/~fkrull/+archive/ubuntu/deadsnakes?field.series_filter=trusty has not seen an updated version of Python3.5 since December for Ubuntu 14.04 (trusty). This is about our only choice for newer Python versions, and I've already done the work to migrate this script to Python3.5, so we could use a single virtual environment for both the HS sorting and crawling scripts. Since at this point in our research we don't really need to run the sorting script, I think I'll just break compatibility with it by making the necessary changes in the ansible roles to install and use Python3.3 and that should hopefully fix things.

♫ Truckin' ♫
...
06:51:26 http://maghreb2z2zua2up.onion: exception: Remote end closed connection without response
06:51:26 http://radiohoodxwsn4es.onion: loading...
06:51:26 http://radiohoodxwsn4es.onion: exception: [Errno 111] Connection refused
06:51:26 http://tqjftqibbwtm4wmg.onion: loading...
06:51:26 http://tqjftqibbwtm4wmg.onion: exception: [Errno 111] Connection refused
06:51:26 http://newstarhrtqt6ua7.onion: loading...
06:51:26 http://newstarhrtqt6ua7.onion: exception: [Errno 111] Connection refused
...
And so on (fails through the rest of the URLs pretty instantly.

https://bugs.python.org/issue26402

@psivesely psivesely changed the title Crawler is running into failures Crawler is running into terminal connection refused socket failures May 18, 2016
@psivesely
Copy link
Contributor Author

Testing 5802bd3 to address this.

@psivesely
Copy link
Contributor Author

Crawls in progress. Will check on them tomorrow morning to see if they failed part-way through or not.

@psivesely
Copy link
Contributor Author

@redshiftzero found that cubie3atuvex2gdw.onion, which redirects to https://another6nnp2ehkn.onion/ (self-signed cert) to have reproduced the error. I'm in the process of refactoring the crawler, but have a couple URLs to add to the "known to have crashed the crawler" list, I should add here soon. These might help in testing/ debugging this problem. There's also been a good amount of discussion of FPF's Slack that I should copy on over here about this bug and plans to figure it out.

@psivesely
Copy link
Contributor Author

Copying my comments from external discussions about this:

Here's the breakdown of what happens: after establishing a connection to a peer on a socket that is bound to a local address, we send a well formed GET request to that peer (an onion service). If this remote end closes the connection without sending a response (i.e., the first line we try to read is empty), then http.client.RemoteDisconnected is raised. This exception in my crawler is caught here. (Realized I need to add a continue statement to the end of this except block and also it would be better to move the circuit cleanup code to the finally block, but anyway, I don't think this is the cause of the problem. I'm going to make a push to fix it, so we can see, but either way let me know what you think). After this error happens, the crawler does not seem to be able to recover. Instead, every site fails the same way.

What happens to the rest of the sites is as follows. A well-formed GET request is drafted and socket.connect() method is called to try to connect to the remote onion service. However, socket.connect() returns ConnectionRefused error, which corresponds to errno ECONNREFUSED and errno 111. It says in the docs that this happens when a connection attempt, is refused by the peer, but I don't think this is the case unless somehow the logic that is ensuring the GET request is well-formed gets screwed up by improper handling of the http.client.RemoteDisconnected exception in CPython. That doesn't seem to be an issue though because a new instance of the HTTPConnection class is created for each connection... so I feel like I need to look at how sockets are implemented in CPython to figure out what's going wrong.

# The crawler keeps crashing after it hits exception 1 single time. Then it hits
# exception 2 for every single connection thereafter. I've broken down the
# traceback with comments. They both start off the same for the first 4 lines or
# so where selenium works out the calls it's going to make to the standard Python
# libraries.
​
# ~*~*~*~* 1 *~*~*~*~
​
"./crawl_hidden_services.py", line 163, in crawl_class
    driver.get(url)
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 245, in get
    self.execute(Command.GET, {\'url\': url})
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 231, in execute
    response = self.command_executor.execute(driver_command, params)
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/remote_connection.py", line 395, in execute
    return self._request(command_info[0], url, body=data)
​
# RemoteConnection._request() sends an HTTP request to the remote server.
# self.keepalive is true so we send that in our request. keep_alive being true
# also means we set self._conn = httplib.HTTPConnection(args) (line 188) in the
# __init__ of our RemoteConnection object instance.
​
# The request goes through okay, so self.__state should be _CS_REQ_SENT.
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/remote_connection.py", line 426, in _request
    resp = self._conn.getresponse()
​
# This is really calling httplib.HTTPConnection.getresponse(). (Note: debug
# option present here). response = HTTPResponse(self.sock, [self.debuglevel,]
# method=self._method). Then we call the begin() method on our HTTPResponse object.
​
"/usr/lib/python3.5/http/client.py", line 1174, in getresponse
    response.begin()
​
# self.headers is not None, so we continue. The first call we make is to
# self._read_status(), to try to read the first line, which should include the
# status information.
​
"/usr/lib/python3.5/http/client.py", line 282, in begin
    version, status, reason = self._read_status()
​
# self._read_status() tries to read the first line of the response, but it is
# empty, so we assume that the remote end closed the connection without a
# response. It shouldn't be able to know we're a crawler because we're using Tor
# Browser...
​
"/usr/lib/python3.5/http/client.py", line 251, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
​
', 'http.client.RemoteDisconnected: Remote end closed connection without response
​
# "A subclass of ConnectionResetError and BadStatusLine. Raised by
# HTTPConnection.getresponse() when the attempt to read the response results in no
# data read from the connection, indicating that the remote end has closed the
# connection."
​
# ~*~*~*~* 2 *~*~*~*~
​
"./crawl_hidden_services.py", line 163, in crawl_class
    driver.get(url)
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 245, in get
    self.execute(Command.GET, {\'url\': url})
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 231, in execute
    response = self.command_executor.execute(driver_command, params)
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/remote_connection.py", line 395, in execute
    return self._request(command_info[0], url, body=data)
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/remote_connection.py", line 425, in _request
    self._conn.request(method, parsed_url.path, body, headers)
​
# httplib.HTTPConnection.request() wraps self._send_request()
​
"/usr/lib/python3.5/http/client.py", line 1083, in request
    self._send_request(method, url, body, headers)
​
# This methods calls self.putrequest(), which may warrant further investigation
# (line 915). And then call self.endheaders()
​
"/usr/lib/python3.5/http/client.py", line 1128, in _send_request
    self.endheaders(body)
​
# This method sends the request to the server. The state is set to
# _CS_REQ_STARTED and then self._send_output() is called
​
"/usr/lib/python3.5/http/client.py", line 1079, in endheaders
    self._send_output(message_body)
​
# This method calls self.send()
​
# You might also ensure self.debuglevel > 0 for more information, if it proves
# necessary.  Then the data is read by the block and sent with
# self.sock.sendall(datablock).
​
"/usr/lib/python3.5/http/client.py", line 911, in _send_output
    self.send(msg)
​
# which first may auto_open a socket. To do so it calls self.connect()
​
"/usr/lib/python3.5/http/client.py", line 854, in send
    self.connect()
​
#  which sets self.sock = self._create_connection()
​
"/usr/lib/python3.5/http/client.py", line 826, in connect
    (self.host,self.port), self.timeout, self.source_address)
​
# which is defined as socket.create_connection in the class instance __init__
# block. socket.create_connection connects to an address and returns the socket
# object. Note self.source_address is defined in our HTTPConnection object, so
# socket.create_connection will try to bind as a source address before making the
# connection. First socket.getaddrinfo() is called on the address, which is a tuple:
# (self.host, self.port). socket.getaddrinfo() translates the host/port arg into a
# sequence of 5-tuples (host, port, family, type, proto, flag) that contain all
# the necessary args for creating a socket connected to that service.
​
"/usr/lib/python3.5/socket.py", line 711, in create_connection
    raise err
​
# After successfully binding to the local address w/ socket.bind(),
# socket.create_connection() tries to connect to the peer with socket.connect()
​
"/usr/lib/python3.5/socket.py", line 702, in create_connection
    sock.connect(sa)
​
# This method fails--the socket class is built-in, so can't debug further.
​
', 'ConnectionRefusedError: [Errno 111] Connection refused
​
# "A subclass of ConnectionError, raised when a connection attempt is refused by
# the peer. Corresponds to errno ECONNREFUSED." Since this is a builtin
# exception, we can't get much more info about it

URLs known to be causing the problem: http://money2mxtcfcauot.onion and http://22222222aziwzse2.onion. (More, but I was negligent in saving them.)

@psivesely
Copy link
Contributor Author

One idea is to basically restart Tor and Tor Browser when this happens. It's a hack, but it isn't my fault that one can't just except and continue this error, and finding/ resolving it upstream has proved to be quit difficult. I'm in the process of implementing that for the refactored crawl_onions.py.

psivesely pushed a commit that referenced this issue Jun 28, 2016
After beginning to install pip packages to root to "simplify" things, I started
noticing some very weird errors when working from within the Ubuntu VMs. Turns
out that the command-not-found package Ubuntu installs and runs as a daemon  was
dependent on python3.4, and would print scary traceback warnings if you typed a
wrong command. The tracebacks were pretty terse, so at first look it seemed to
me like the pip installation had just gone totally haywire.

Setting `update-alternatives --install /usr/bin/python3 python3
/usr/bin/python3.4 2` (i.e., setting Python 3.4 as the second option when
`/usr/bin/python3` is called) resolved the the problem described above. However,
a new scary warning just took it's place, which was caused by the inability to
read the `sources.list.d/` file generated by the Ansible `apt_repository` module
when adding the deadsnakes ppa. I set the same permissions on this file as
`/etc/apt/sources.list` has (`chmod 0644`) and then things seemed to work
smoothly.

While doing this Ansible Python install work, I also experimented with getting
a version of Python 3.5.2 running in a VM. Python 3.5.2 includes bug fixes for
Python 3.5.1, some of which are related to exceptions that have been crashing
our Crawler (see https://bugs.python.org/issue26402 and
#4). These fixes
may help us avoid the hacky workarounds I've been trying to implement for the
crawler refactor.

Unfortunately, installing 3.5.2 proved more difficult than I imagined because
there seems to be no easy way to do apt-pinning with Ansible and none of the
Debian-derivative boxes from trustworthy, well-known sources that were of
distro versions that shipped 3.5.2 (namely, Debian stretch and Ubuntu yakkety)
had the vboxfs kernel module installed, which is essential to development.
Doing a dist-upgrade is not only time prohibitive, but it seems Ansible can't
even handle doing so.

As far as apt-pinning goes I tried both `command: aptitude install -y
python3.5/unstable` and using the `force: yes` directive in conjunction with
the `apt` Ansible module.

I am commiting the commented out lines just to include some options since it's
still unknown what's causing this crash and how effective the workarounds
mentioned in
#4 (comment)
will be.
@psivesely psivesely added the bug label Sep 15, 2016
@psivesely psivesely removed their assignment Sep 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant