Hemant Vishwakarma: Preserving original text indentation/structure with Tesseract OCR 4.x

Monday, 6 January 2020

Preserving original text indentation/structure with Tesseract OCR 4.x

I'm struggling with Tesseract OCR. I have a blood examination image, it has a table with indentation. Although tesseract recognizes the characters very well, its structure isn't preserved in the final output.

I read the other related discussions and I found the option preserve_interword_spaces=1. The result became slightly better but as you can see, it isn't perfect.

Any suggestions?

Update:

I tried Tesseract v5.0 and the result is the same.

Code:

Tesseract version is 4.0.0.20190314

from PIL import Image
import pytesseract

# Preserve interword spaces is set to 1, oem = 1 is LSTM, 
# PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection

custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'

# default_config = r'-c -l eng+ita'

extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)

print(extracted_text)

# saving to a txt file

with open("referto.txt", "w") as text_file:
    text_file.write(extracted_text)

Result with comparison: