Sunday, 21 April 2019

pdf2image how to read pdfs with "enable all features" - windows

I have a pdf and i would like to read it in Python. When I open it on my machine using acrobat, I get below message and when I click on "enable all features", the file shows it's actual content. enter image description here enter image description here

When I try to read it in python, how could I achieve the same action so that python reads the actual text and doesn't read the below text

"Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download. For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader. Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. "

My code is as below

from PIL import Image
import pytesseract

homepath = r'C:\Users\xxxx\\'


files = "bbbb.pdf"
PDFfilename = homepath  + files

from pdf2image import convert_from_path
pages = convert_from_path(PDFfilename, 500)

i=1
for page in pages:
    page.save(homepath +'out'+str(i)+'.jpg', 'JPEG')
    text = pytesseract.image_to_string(Image.open(homepath +'out'+str(i)+'.jpg'))
    print(text)
    i=i+1



from pdf2image how to read pdfs with "enable all features" - windows

No comments:

Post a Comment