Sunday 26 December 2021

Running two Tensorflow trainings in parallel using joblib and dask

I have the following code that runs two TensorFlow trainings in parallel using Dask workers implemented in Docker containers.

To that end, I do the following:

  • I use joblib.delayed to spawn the two processes.
  • Within each process I run with joblib.parallel_backend('dask'): to execute the fit/training logic. Each training process triggers N dask workers.

The problem is that I don't know if the entire process is thread safe, are there any concurrency elements that I'm missing?

# First, submit the function twice using joblib delay
delayed_funcs = [joblib.delayed(train)(sub_task) for sub_task in [123, 456]]
parallel_pool = joblib.Parallel(n_jobs=2)
parallel_pool(delayed_funcs)

# Second, submit each training process
def train(sub_task):

    global client
    if client is None:
        print('connecting')
        client = Client()

    data = some_data_to_train

    # Third, process the training itself with N workers
    with joblib.parallel_backend('dask'):
        X = data[columns] 
        y = data[label]

        niceties = dict(verbose=False)
        model = KerasClassifier(build_fn=build_layers,
                loss=tf.keras.losses.MeanSquaredError(), **niceties)
        model.fit(X, y, epochs=500, verbose = 0)


from Running two Tensorflow trainings in parallel using joblib and dask

No comments:

Post a Comment