Tuesday, 7 May 2019

Worker process crashes on requests.get() when data is put into input queue before the worker process starts

In macOS High Sierra (Version 10.13.6), I run a Python program that does the following:

  • Launches a worker process that consumes data (URL strings) from a multiprocessing.Queue.
  • The worker process sends HTTP requests with the requests package, i.e., it makes requests.get() calls.
  • Some data (a URL string) is fed to the queue even before the worker process is started.

A program satisfying the above conditions leads to the worker process crashing with this error:

objc[24250]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[24250]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

I have read the following threads:

These threads focus on a workaround for the user. The workaround is defining this environment variable:

OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

In this question, I would like to understand why only certain conditions reproduce the error whereas other conditions do not and how to resolve this issue without putting the burden of defining the environment variable OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES on the user.

Minimal example of the issue

import multiprocessing as mp
import requests


def worker(q):
    print('worker: starting ...')

    while True:
        url = q.get()
        if url is None:
            print('worker: exiting ...')
            break

        print('worker: fetching', url)
        response = requests.get(url)
        print('worker: response:', response.status_code)


def master():
    q = mp.Queue()
    p = mp.Process(target=worker, args=(q,))
    q.put('https://www.example.com/')

    p.start()
    print('master: started worker')

    q.put('https://www.example.org/')
    q.put('https://www.example.net/')
    q.put(None)
    print('master: sent data')

    print('master: waiting for worker to exit')
    p.join()
    print('master: exiting ...')


master()

Here is the output with the error:

$ python3 foo.py 
master: started worker
master: sent data
master: waiting for worker to exit
worker: starting ...
worker: fetching https://www.example.com/
objc[24250]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[24250]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
master: exiting ...

Resolutions

Here are a few independent things I have seen that resolve the issue, i.e., performing only one of these resolves the issue:

  1. The issue seems to occur only on using the requests package. If we comment out these two lines in worker(), it resolves the issue.

        # response = requests.get(url)
        # print('worker: response:', response.status_code)
    
    
  2. The issue seems to occur only if q.put('https://www.example.com/') statement occurs before the p.start() statement. If we move that statement ater p.start(), that resolves the issue.

        p.start()
        print('master: started worker')
    
        q.put('https://www.example.com/')
    
    
  3. Setting the environment variable OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES resolves the issue.

    OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES python3 foo.py
    
    

Non-Resolution

Now, I do not want my users to set a variable name like this to be able to use my tool or API, so I was trying to figure if setting this environment variable within my program could resolve the issue. I found that that adding this to my code does not resolve the issue:

import os
os.environ['OBJC_DISABLE_INITIALIZE_FORK_SAFETY'] = 'YES'
# Does not resolve the issue!

Questions

  1. Why exactly does this issue occur only under the given conditions, i.e., requests.get() and q.put() before p.start()? In other words, why does the issue disappear if one of these conditions are not met?

  2. If we were to expose something like the minimal example as an API function that another developer might call from their code, is there any clever way to resolve this issue in our code, so that the other developer does not have to set OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES in their shell before running their program that uses our function?

Of course, a possible solution is to redesign the solution such that we don't have to feed data into the queue before the worker process starts. That's definitely a possible solution. The scope of this question though is to discuss why this issue occurs only when we feed data into the queue before the worker process starts.



from Worker process crashes on requests.get() when data is put into input queue before the worker process starts

No comments:

Post a Comment