Wednesday, 28 November 2018

PyTesseract call working very slow when used along with multiprocessing

I've a function that takes in a list of images and produces the output, in a list, after applying OCR to the image. I have an another function that controls the input to this function, by using multiprocessing. So, when I have a single list (i.e. no multiprocessing), each image of the list took ~ 1s, but when I increased the lists that had to be processed parallely to 4, each image took an astounding 13s.

To understand where the problem really is, I tried to create a minimal working example of the problem. Here I have two functions eat25 and eat100 which open an image name and feed it to the OCR, that uses the API pytesseract. eat25 does it 25 times, and eat100 does it 100 times.

My aim here is to run eat100 without multiprocessing, and eat25 with multiprocessing (with 4 processes). This, theoretically, should take 4 times less time that eat100 if I have 4 separate processors (I have 2 cores with 2 threads per core, thus CPU(s) = 4 (correct me if I'm wrong here)).

But all theory laid wasted when I saw that the code didn't even respond after printing "Processing 0" 4 times. The single processor function eat100 worked fine though.

I had tested a simple range cubing function, and it did work well with multiprocessing, so my processors do work well for sure. The only culprits here could be:

  • pytesseract: See this
  • Bad code? Something I am not doing right.

`

from pathos.multiprocessing import ProcessingPool
from time import time 
from PIL import Image
import pytesseract as pt
def eat25(name):
    for i in range(25):
        print('Processing :'+str(i))
        pt.image_to_string(Image.open(name),lang='hin+eng',config='--psm 6')
def eat100(name):
    for i in range(100):
        print('Processing :'+str(i))
        pt.image_to_string(Image.open(name),lang='hin+eng',config='--psm 6')
st = time()
eat100('normalBox.tiff')
en = time()
print('Direct :'+str(en-st))
#Using pathos
def caller():
    pool = ProcessingPool()
    pool.map(eat25,['normalBox.tiff','normalBox.tiff','normalBox.tiff','normalBox.tiff'])
if (__name__=='__main__'):
    caller()
en2 = time()

print('Pathos :'+str(en2-en))

So, where the problem really is? Any help is appreciated!

EDIT: The image normalBox.tiff can be found here. I would be glad if people reproduce the code and check if the problem continues.



from PyTesseract call working very slow when used along with multiprocessing

No comments:

Post a Comment