Saturday 13 March 2021

How to remove watermark from PDF file using Python's PyPDF2 lib

I have wrote a code that extracts the text from PDF file with Python and PyPDF2 lib. Code works good for most docs but sometimes it returns some strange characters. I think thats because PDF has watermark over the page so it does not recognise the text:

import requests
from io import StringIO, BytesIO
import PyPDF2

def pdf_content_extraction(pdf_link):

    all_pdf_content = ''

    #sending requests
    response = requests.get(pdf_link)
    my_raw_data = response.content


    pdf_file_text = 'PDF File: ' + pdf_link + '\n\n'
    #extract text page by page
    with BytesIO(my_raw_data) as data:
        read_pdf = PyPDF2.PdfFileReader(data)

        #looping trough each page
        for page in range(read_pdf.getNumPages()):
            page_content = read_pdf.getPage(page).extractText()
            page_content = page_content.replace("\n\n\n", "\n").strip()

            #store data into variable for each page
            pdf_file_text += page_content + '\n\nPAGE '+ str(page+1) + '/' + str(read_pdf.getNumPages()) +'\n\n\n'

    all_pdf_content += pdf_file_text + "\n\n"
        
    return all_pdf_content



pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'

print(pdf_content_extraction(pdf_link))

This is the result that I'm getting:

#$%˘˘
&'(˝˙˝˙)*+"*˜
˜*
,*˜*˜ˆ+-*˘!(
.˜($*%(#%*˜-/
"*
*˜˜0!0˘˘*˜˘˜ˆ
+˜(%
*
*(+%*˜+"*˜'
$*1˜ˆ
...
...

My question is, how can I fix this problem? Is there a way to remove watermark from page or something like that? I mean, maybe this problem can be fixed in some other way, maybe the problem is not in that watermark/logo?



from How to remove watermark from PDF file using Python's PyPDF2 lib

No comments:

Post a Comment