Monday 25 February 2019

Understanding tf.contrib.lite.TFLiteConverter quantization parameters

I'm trying to use UINT8 quantization while converting tensorflow model to tflite model:

If use post_training_quantize = True, model size is x4 lower then original fp32 model, so I assume that model weights are uint8, but when I load model and get input type via interpreter_aligner.get_input_details()[0]['dtype'] it's float32. Outputs of the quantized model are about the same as original model.

converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(
converter.post_training_quantize = True
tflite_model = converter.convert()

Input/output of converted model:

[{'name': 'input_1_1', 'index': 47, 'shape': array([  1, 128, 128,   3], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}]
[{'name': 'global_average_pooling2d_1_1/Mean', 'index': 45, 'shape': array([  1, 156], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}]

Another option is to specify more parameters explicitly: Model size is x4 lower then original fp32 model, model input type is uint8, but model outputs are more like garbage.

converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(
converter.post_training_quantize = True
converter.inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8
converter.quantized_input_stats = {input_node_names[0]: (0.0, 255.0)}  # (mean, stddev)
converter.default_ranges_stats = (-100, +100)
tflite_model = converter.convert()

Input/output of converted model:

[{'name': 'input_1_1', 'index': 47, 'shape': array([  1, 128, 128,   3], dtype=int32), 'dtype': <class 'numpy.uint8'>, 'quantization': (0.003921568859368563, 0)}]
[{'name': 'global_average_pooling2d_1_1/Mean', 'index': 45, 'shape': array([  1, 156], dtype=int32), 'dtype': <class 'numpy.uint8'>, 'quantization': (0.7843137383460999, 128)}]

So my questions are:

  1. What is happenning when only post_training_quantize = True is set? i.e. why 1st case work fine, but second don't.
  2. How to estimate mean, std and range parameters for second case?
  3. Looks like in second case model inference is faster, is it depend on the fact that model input is uint8?
  4. What means 'quantization': (0.0, 0) in 1st case and 'quantization': (0.003921568859368563, 0),'quantization': (0.7843137383460999, 128) in 2nd case?

