I have created a classification model using Python 3.9.5, Keras 2.4.3 and tensorflow-cpu 2.5.0. The model works fine in on my Windows 10 development environment but it stops executing further script and gets stuck when I deploy it in a Docker container. The step where it gets stuck and becomes unresponsive is when I add the first layer. Nothing gets printed in logs either.
This behavior is random (has happened 4th and 20th time and any number of times in between) while training. For reproducible results I train the model in a separate process due to the randomness produced by 3rd party libraries used in my FastAPI application. Also, I do not see anything out of the ordinary when I run docker ps.
Source code / logs
Model Structure
try:
log.info("Initializing Sequential Model")
model = Sequential()
log.info("Initializing GlorotNormal")
initializer = initializers.GlorotNormal()
log.info("Adding LSTM as input layer ")
model.add(LSTM(100, input_shape=(
train_x.shape[1:]), return_sequences=False))
log.info("Adding hidden dense layer")
model.add(Dense(64, activation='selu', name="layer2",
kernel_initializer=initializer))
log.info("Adding Dropout")
model.add(Dropout(rate=0.5))
log.info("Adding Output layer")
model.add(Dense(len(intent_tags), activation='softmax', name="layer3"))
log.info("Generating model Summary")
model.summary()
log.info("Compiling model")
model.compile(loss='categorical_crossentropy', optimizer=
tf.keras.optimizers.Adamax(learning_rate=0.005), metrics=['accuracy'])
log.info("Model Compiled succesfully")
Model fit:
model: Sequential = create_training_model(train_x, train_y, intent_tags)
log.info("Model Created")
add_into_queue: LambdaCallback = LambdaCallback(on_epoch_end=lambda epoch,_: queue.put({"type": "progress", "sub_type": "training_progress", "progress": f'EPOCHS: {epoch+1}/{configuration_epochs}'}))
es: EarlyStopping = EarlyStopping(monitor='loss', mode='min',verbose=1, patience=30, min_delta=1)
log.info("fitting Training")
history: object = model.fit(train_x, train_y, epochs=200, batch_size=5,
verbose=1, validation_data=(test_x, test_y),
callbacks=[es, add_into_queue])
if es.stopped_epoch:
training_completed_message: str = f"Training completed {es.stopped_epoch}/{configuration_epochs} Epoch, Early Stopping applied"
log.info(training_completed_message)
progress_data: dict = {"type": "progress", "sub_type":"training_completed" , "progress": str(training_completed_message)}
queue.put(progress_data)
else:
progress_data: dict = {type": "progress", "sub_type": "training_completed","progress": str(configuration_epochs)}
queue.put(progress_data)
Fastapi websocket code snippet for training model:
try:
configuration["TRAINING_COUNT"] +=1
log.info(f"Training Count: {configuration['TRAINING_COUNT']}")
log.info("Starts training on seprate procces")
multi_process = Process(target=chatbot_training, args=(qestions_answers, training_type, client_id, saved_file_path, queue), name=f"training_process_{client_id}")
multi_process.start()
log.info("Initializing thread to send training progress")
data_progress_thread = threading.Thread(target = send_data_progress_call, args=[websocket, queue] , name="data_progress_thread")
data_progress_thread.daemon = True
data_progress_thread.start()
Dockerfile
FROM python:3.9.5-slim-buster
COPY ./ /app
WORKDIR /app
RUN pip install -r requirements.txt && \
python -m nltk.downloader punkt && \
python -m nltk.downloader wordnet && \
python -m nltk.downloader averaged_perceptron_tagger && \
python -m pip cache purge
ENV PYTHONHASHSEED=100
CMD ["python", "./starfighter/app.py"]
Docker Container logs: error logs
Result of docker stats container_name: Container stats
Results of docker top container_name: top command results
Logs on development environment Logs on development environment
Steps to reproduce:
train model 20-40 times to reproduce the error, for saving time use small dataset
Environment information
Server OS = Centos 7
docker base image = python:3.9.5-slim-buster
Python Version = 3.9.5
tensorflow-cpu==2.5.0
keras==2.4.3
nltk==3.5
pyspellchecker==0.6.2
pandas==1.2.4
fastapi==0.65.1
aiofiles==0.7.0
openpyxl==3.0.7
websockets==9.0.2
numpy==1.19.5
strictyaml
uvicorn==0.13.4
PyYAML==5.4.1
from KERAS stuck randomly while adding first layer inside docker container
No comments:
Post a Comment