Sunday 29 November 2020

Elasticsearch/dataflow - connection timeout after ~60 concurrent connection

We host elatsicsearch cluster on Elastic Cloud and call it from dataflow (GCP). Job works fine in dev but when we deploy to prod we're seeing lots of connection timeout on the client side.

Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1213, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 570, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "main.py", line 159, in process
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/utils.py", line 152, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/__init__.py", line 1617, in search
    body=body,
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py", line 390, in perform_request
    raise e
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py", line 365, in perform_request
    timeout=timeout,
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 258, in perform_request
    raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPSConnection object at 0x7fe5d04e5690>: Failed to establish a new connection: [Errno 110] Connection timed out) caused by: NewConnectionError(<urllib3.connection.HTTPSConnection object at 0x7fe5d04e5690>: Failed to establish a new connection: [Errno 110] Connection timed out)

I increased timeout setting in elasticsearch client to 300s like below but it didn't seem to help.

self.elasticsearch = Elasticsearch([es_host], http_auth=http_auth, timeout=300)

Looking at deployment at https://cloud.elastic.co/deployments//metrics CPU and memory usage are very low (below 10%) and search response time is also order of 200ms. What could be the bottleneck here and how we can we avoid such timeouts?

As seen in below log most of requests are failing with connection timeout while successful request receives response very quick:

enter image description here

I tried ssh into the VM where we experience the connection error. netstat showed there were about 60 ESTABLISHED connections to the elastic search IP address. When I curl from the VM to elasticsearch address I was able to reproduce timeout. I can curl fine to other URLs. Also I can curl fine to elasticsearch from my local so issue is only connection between VM and elasticsaerch server.

Does dataflow (compute engine) or ElasticSearch has limitation on number of concurrent connection? I could not find any information online.



from Elasticsearch/dataflow - connection timeout after ~60 concurrent connection

No comments:

Post a Comment