Monday, 31 July 2023

Running Ray on top of AWS Batch multi-node?

I am interested in running Ray on AWS Batch multi-node. This is a pattern that hasn't been done before on Ray, and thus, there's no documentation on it. But, I'd really like to try it since Ray can be installed on-premise as well.

I stood up the AWS Batch multi-node gang-scheduled closer and ran the following commands:

  1. For the head node:
subprocess.Popen(f"ray start --head --node-ip-address {current.parallel.main_ip} --port {master_port} --block", shell=True).wait()
  1. For the worker nodes:
import ray
node_ip_address = ray._private.services.get_node_ip_address()
subprocess.Popen(f"ray start --node-ip-address {node_ip_address} --address {current.parallel.main_ip}:{master_port} --block", shell=True).wait()

The head node seems to be working, but there's some issue with the worker nodes not syncing with the head node.

I get the following output in stderr:

[2023-07-28 09:25:55,500 I 427 427] global_state_accessor.cc:356: This node has an IP address of 10.14.52.21, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.

Any insight on how I can get Ray working on AWS Batch multi-node would be much appreciated!



from Running Ray on top of AWS Batch multi-node?

No comments:

Post a Comment