I am creating an inference service with torch, gunicorn and flask that should use CUDA. To reduce resource requirements, I use the preload option of gunicorn, so the model is shared between the worker processes. However, this leads to an issue with CUDA. The following code snipped shows a minimal reproducing example:
from flask import Flask, request
import torch
app = Flask('dummy')
model = torch.rand(500)
model = model.to('cuda:0')
@app.route('/', methods=['POST'])
def f():
data = request.get_json()
x = torch.rand((data['number'], 500))
x = x.to('cuda:0')
res = x * model
return {
"result": res.sum().item()
}
Starting the server with CUDA_VISIBLE_DEVICES=1 gunicorn -w 3 -b $HOST_IP:8080 --preload run_server:app lets the service start successfully. However, once doing the first request (curl -X POST -d '{"number": 1}'), the worker throws the following error:
[2022-06-28 09:42:00,378] ERROR in app: Exception on / [POST]
Traceback (most recent call last):
File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/home/user/.local/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/home/user/project/run_server.py", line 14, in f
x = x.to('cuda:0')
File "/home/user/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 195, in _lazy_init
"Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
I load the model in the parent process and it's accessible to each forked worker process. The problem occurs when creating a CUDA-backed tensor in the worker process. This re-initializes the CUDA context in the worker process, which fails because it was already initialized in the parent process. If we set x = data['number'] and remove x = x.to('cuda:0'), the inference succeeds.
Adding torch.multiprocessing.set_start_method('spawn') or multiprocessing.set_start_method('spawn') won't change anything, probably because gunicorn will definitely use fork when being started with the --preload option.
A solution could be not using the --preload option, which leads to multiple copies of the model in memory/GPU. But this is what I am trying to avoid.
Is there any possibility to overcome this issue without loading the model separately in each worker process?
from GUnicorn + CUDA: Cannot re-initialize CUDA in forked subprocess
No comments:
Post a Comment