I've written a script using python in combination with PyPDF2
, PIL
and pytesseract
to extract the text from the first page of the scanned pages
of a pdf
file. However, when I tried the below script to get the content from the first scanned page
out of that pdf
file, It throws the following error when reaches the line containing img = Image.open(pdfReader.getPage(0)).convert('L')
.
Script I have tried so far:
import PyPDF2
import pytesseract
from PIL import Image
pdfFileObj = open(r'C:\Users\WCS\Desktop\Scan project\Scanned.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
img = Image.open(pdfReader.getPage(0)).convert('L')
imagetext = pytesseract.image_to_string(img)
print(imagetext)
pdfFileObj.close()
Error I'm having:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\SO.py", line 8, in <module>
img = Image.open(pdfReader.getPage(0)).convert('L')
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\site-packages\PIL\Image.py", line 2554, in open
fp = io.BytesIO(fp.read())
AttributeError: 'PageObject' object has no attribute 'read'
How can I make it a go successfully?
from Can't execute the following script successfully
No comments:
Post a Comment