I'm struggling with Tesseract OCR. I have a blood examination image, it has a table with indentation. Although tesseract recognizes the characters very well, its structure isn't preserved in the final output.
I read the other related discussions and I found the option preserve_interword_spaces=1. The result became slightly better but as you can see, it isn't perfect.
Any suggestions?
Update:
I tried Tesseract v5.0 and the result is the same.
Code:
Tesseract version is 4.0.0.20190314
from PIL import Image
import pytesseract
# Preserve interword spaces is set to 1, oem = 1 is LSTM,
# PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
# default_config = r'-c -l eng+ita'
extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)
print(extracted_text)
# saving to a txt file
with open("referto.txt", "w") as text_file:
text_file.write(extracted_text)
Result with comparison:
GITHUB:
I have created a GitHub repository if you want to try it yourself.
Thanks for your help and your time
from Preserving original text indentation/structure with Tesseract OCR 4.x

I'm having the same problem but no luck in solving it.
ReplyDelete