Hemant Vishwakarma: KERAS stuck randomly while adding first layer inside docker container

I have created a classification model using Python 3.9.5, Keras 2.4.3 and tensorflow-cpu 2.5.0. The model works fine in on my Windows 10 development environment but it stops executing further script and gets stuck when I deploy it in a Docker container. The step where it gets stuck and becomes unresponsive is when I add the first layer. Nothing gets printed in logs either.

This behavior is random (has happened 4th and 20th time and any number of times in between) while training. For reproducible results I train the model in a separate process due to the randomness produced by 3rd party libraries used in my FastAPI application. Also, I do not see anything out of the ordinary when I run docker ps.

Source code / logs

Model Structure

try:
        log.info("Initializing Sequential Model")
        model = Sequential()

        log.info("Initializing GlorotNormal")
        initializer = initializers.GlorotNormal()

        log.info("Adding LSTM as input layer ")
        model.add(LSTM(100,  input_shape=(
            train_x.shape[1:]), return_sequences=False))

        log.info("Adding hidden dense layer")
        model.add(Dense(64, activation='selu', name="layer2", 
            kernel_initializer=initializer))
        
        log.info("Adding Dropout")
        model.add(Dropout(rate=0.5))

        log.info("Adding Output layer")
        model.add(Dense(len(intent_tags), activation='softmax', name="layer3"))
        
        log.info("Generating model Summary")
        model.summary()

        log.info("Compiling model")
        model.compile(loss='categorical_crossentropy', optimizer=
           tf.keras.optimizers.Adamax(learning_rate=0.005), metrics=['accuracy'])

        log.info("Model Compiled succesfully")

Model fit:

model: Sequential = create_training_model(train_x, train_y, intent_tags)
log.info("Model Created")

add_into_queue: LambdaCallback = LambdaCallback(on_epoch_end=lambda epoch,_: queue.put({"type": "progress", "sub_type": "training_progress", "progress": f'EPOCHS: {epoch+1}/{configuration_epochs}'}))
es: EarlyStopping = EarlyStopping(monitor='loss', mode='min',verbose=1, patience=30, min_delta=1)
log.info("fitting Training")

history: object = model.fit(train_x, train_y, epochs=200, batch_size=5,
                  verbose=1, validation_data=(test_x, test_y), 
                  callbacks=[es, add_into_queue])

 
if es.stopped_epoch:
      training_completed_message: str = f"Training completed {es.stopped_epoch}/{configuration_epochs} Epoch, Early Stopping applied"
      log.info(training_completed_message)

      progress_data: dict = {"type": "progress", "sub_type":"training_completed"  , "progress": str(training_completed_message)}
      queue.put(progress_data)

else:
      progress_data: dict = {type": "progress", "sub_type": "training_completed","progress": str(configuration_epochs)}
      queue.put(progress_data)

Fastapi websocket code snippet for training model:

try:
    configuration["TRAINING_COUNT"] +=1
    log.info(f"Training Count: {configuration['TRAINING_COUNT']}")
    log.info("Starts training on seprate procces")
    multi_process = Process(target=chatbot_training, args=(qestions_answers, training_type, client_id, saved_file_path, queue), name=f"training_process_{client_id}")
    multi_process.start()
                        
   log.info("Initializing thread to send training progress")
   data_progress_thread = threading.Thread(target = send_data_progress_call, args=[websocket, queue] , name="data_progress_thread")
   data_progress_thread.daemon = True
   data_progress_thread.start()

Dockerfile

FROM python:3.9.5-slim-buster
COPY ./ /app
WORKDIR /app
RUN pip install -r requirements.txt && \
python -m nltk.downloader punkt && \
python -m nltk.downloader wordnet && \
python -m nltk.downloader averaged_perceptron_tagger && \
python -m pip cache purge
ENV PYTHONHASHSEED=100
CMD ["python", "./starfighter/app.py"]

Docker Container logs: error logs

Result of docker stats container_name: Container stats

Results of docker top container_name: top command results

Logs on development environment Logs on development environment

Steps to reproduce:

train model 20-40 times to reproduce the error, for saving time use small dataset

Environment information

Server OS = Centos 7
docker base image = python:3.9.5-slim-buster
Python Version = 3.9.5
tensorflow-cpu==2.5.0
keras==2.4.3
nltk==3.5
pyspellchecker==0.6.2
pandas==1.2.4
fastapi==0.65.1
aiofiles==0.7.0
openpyxl==3.0.7
websockets==9.0.2
numpy==1.19.5
strictyaml
uvicorn==0.13.4
PyYAML==5.4.1

from KERAS stuck randomly while adding first layer inside docker container

Hemant Vishwakarma

Saturday, 19 November 2022

KERAS stuck randomly while adding first layer inside docker container

Source code / logs

Steps to reproduce:

Environment information

No comments:

Post a Comment