I've a function that takes in a list of images and produces the output, in a list, after applying OCR to the image. I have an another function that controls the input to this function, by using multiprocessing. So, when I have a single list (i.e. no multiprocessing), each image of the list took ~ 1s, but when I increased the lists that had to be processed parallely to 4, each image took an astounding 13s.
To understand where the problem really is, I tried to create a minimal working example of the problem. Here I have two functions eat25
and eat100
which open an image name
and feed it to the OCR, that uses the API pytesseract
. eat25
does it 25 times, and eat100
does it 100 times.
My aim here is to run eat100
without multiprocessing, and eat25
with multiprocessing (with 4 processes). This, theoretically, should take 4 times less time that eat100
if I have 4 separate processors (I have 2 cores with 2 threads per core, thus CPU(s) = 4 (correct me if I'm wrong here)).
But all theory laid wasted when I saw that the code didn't even respond after printing "Processing 0" 4 times. The single processor function eat100
worked fine though.
I had tested a simple range cubing function, and it did work well with multiprocessing, so my processors do work well for sure. The only culprits here could be:
pytesseract
: See this- Bad code? Something I am not doing right.
`
from pathos.multiprocessing import ProcessingPool
from time import time
from PIL import Image
import pytesseract as pt
def eat25(name):
for i in range(25):
print('Processing :'+str(i))
pt.image_to_string(Image.open(name),lang='hin+eng',config='--psm 6')
def eat100(name):
for i in range(100):
print('Processing :'+str(i))
pt.image_to_string(Image.open(name),lang='hin+eng',config='--psm 6')
st = time()
eat100('normalBox.tiff')
en = time()
print('Direct :'+str(en-st))
#Using pathos
def caller():
pool = ProcessingPool()
pool.map(eat25,['normalBox.tiff','normalBox.tiff','normalBox.tiff','normalBox.tiff'])
if (__name__=='__main__'):
caller()
en2 = time()
print('Pathos :'+str(en2-en))
So, where the problem really is? Any help is appreciated!
EDIT: The image normalBox.tiff
can be found here. I would be glad if people reproduce the code and check if the problem continues.
from PyTesseract call working very slow when used along with multiprocessing
No comments:
Post a Comment