I am interested in running Ray on AWS Batch multi-node. This is a pattern that hasn't been done before on Ray, and thus, there's no documentation on it. But, I'd really like to try it since Ray can be installed on-premise as well.
I stood up the AWS Batch multi-node gang-scheduled closer and ran the following commands:
- For the head node:
subprocess.Popen(f"ray start --head --node-ip-address {current.parallel.main_ip} --port {master_port} --block", shell=True).wait()
- For the worker nodes:
import ray
node_ip_address = ray._private.services.get_node_ip_address()
subprocess.Popen(f"ray start --node-ip-address {node_ip_address} --address {current.parallel.main_ip}:{master_port} --block", shell=True).wait()
The head node seems to be working, but there's some issue with the worker nodes not syncing with the head node.
I get the following output in stderr
:
[2023-07-28 09:25:55,500 I 427 427] global_state_accessor.cc:356: This node has an IP address of 10.14.52.21, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
Any insight on how I can get Ray working on AWS Batch multi-node would be much appreciated!
from Running Ray on top of AWS Batch multi-node?
No comments:
Post a Comment