Friday 23 October 2020

Google Colaboratory session abruptly ends when filling up shuffle buffer

I am using Google Colaboratory to train an image recognition algorithm, using TensorFlow 1.15. I have uploaded all needed files into Google Drive, and have gotten the code to run until the shuffle buffer finishes running. However, I get a "^C" in the dialog box, and cannot figure out what is going on.

Note: I have previously tried to train the algorithm on my PC, and did not delete the checkpoint files that were generated from the previous training session. Could that perhaps be the problem?

Code:

!pip install --upgrade pip
!pip install --upgrade protobuf

!pip install tensorflow-gpu==1.15
import tensorflow as tf
print(tf.__version__)

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at {}'.format(device_name))

!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
gpu = GPUs[0]
def printm():
  process = psutil.Process(os.getpid())
  print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
  print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

from google.colab import drive
#Mount the drive
drive.mount('/content/gdrive')

#Change to working tensorflow directory on the drive
%cd '/content/gdrive/My Drive/weeds/tensorflow_models/models/research/object_detection/'

!apt-get install protobuf-compiler python-pil python-lxml python-tk
!pip install Cython
%cd /content/gdrive/My Drive/weeds/tensorflow_models/models/research/
!protoc object_detection/protos/*.proto --python_out=.
import os
os.environ['PYTHONPATH'] += ':/content/gdrive/My Drive/weeds/tensorflow_models/models/research/:/content/gdrive/My Drive/weeds/tensorflow_models/models/research/slim'
!python setup.py build
!python setup.py install

import time, psutil
Start = time.time() - psutil.boot_time()
Left = 12*3600 - Start
print('Time remaining for this session is: ', Left/3600)

!pip install tf_slim
%cd /content/gdrive/My Drive/weeds/tensorflow_models/models/research/object_detection/
os.environ['PYTHONPATH'] += ':/content/gdrive/My Drive/weeds/tensorflow_models/models/research/:/content/gdrive/My Drive/weeds/tensorflow_models/models/research/slim'

!python train.py --train_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_coco.config --logtostderr

The process terminates here, but it needs to start training the model with "global steps."

2020-10-18 22:42:45.587477: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 168 of 2048
2020-10-18 22:42:55.668973: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 334 of 2048
2020-10-18 22:43:06.067869: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 379 of 2048
2020-10-18 22:43:15.705090: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 503 of 2048
2020-10-18 22:43:26.781151: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 576 of 2048
2020-10-18 22:43:38.120069: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 640 of 2048
2020-10-18 22:43:45.813089: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 708 of 2048
2020-10-18 22:43:58.071040: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 752 of 2048
2020-10-18 22:44:07.506961: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 828 of 2048
2020-10-18 22:44:16.355753: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 908 of 2048
2020-10-18 22:44:25.922348: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 960 of 2048
INFO:tensorflow:global_step/sec: 0
I1018 22:44:34.783342 140291121678080 supervisor.py:1099] global_step/sec: 0
2020-10-18 22:44:36.327813: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1036 of 2048
2020-10-18 22:44:45.651473: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1151 of 2048
2020-10-18 22:44:55.554234: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1186 of 2048
2020-10-18 22:45:05.648568: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1242 of 2048
2020-10-18 22:45:15.644396: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1313 of 2048
2020-10-18 22:45:25.551708: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1386 of 2048
2020-10-18 22:45:35.549003: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1458 of 2048
2020-10-18 22:45:45.648835: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1531 of 2048
2020-10-18 22:45:55.643920: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1602 of 2048
2020-10-18 22:46:05.559702: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1674 of 2048
2020-10-18 22:46:15.547609: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1746 of 2048
2020-10-18 22:46:25.645939: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1819 of 2048
INFO:tensorflow:global_step/sec: 0
I1018 22:46:35.052108 140291121678080 supervisor.py:1099] global_step/sec: 0
2020-10-18 22:46:35.645583: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1891 of 2048
2020-10-18 22:46:45.553851: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1962 of 2048
^C

What can I do to fix this? The training process works great with my PC (NVIDA GEFORCE RTX), but I just need some more computation power through Google Colab.



from Google Colaboratory session abruptly ends when filling up shuffle buffer

No comments:

Post a Comment