The last approach was not good, timeout of 0.1 seconds was too short. A
bunch of stuff has to happen in the timeout period inside of
rethinkdb.connect(). It doesn't offer a way to set only the socket
timeout. Even a timeout of 0.5 seconds results in a noticeable error
rate.
The new approach is to put a server in the penalty box for 5 minutes
when it errors. While the server is in the penalty box, we don't try to
connect to it, unless all the servers are in the penalty box, in which
case we try the server that errored least recently.
Another tweak to that end. We have observed that when a rethinkdb server
is offline, an attempt to connect to it takes a second or two to time
out. On the other hand, if the host is up but the port is not open
(rethinkdb is not running or something like that), the connection
failure happens very quickly.
To achieve good performance in case a rethinkdb server is down, we are
now setting a timeout on the connect() call. The timeout starts at
0.1 sec, for quick retry, and backs off up to 10 sec in case of repeated
failures.
it turns out that when iterating over results sometimes (always?) errors
that are recoverable when running a query are not recoverable, so we've
been ending up in infinite loops