Hemant Vishwakarma: How to remove watermark from PDF file using Python's PyPDF2 lib

Saturday 13 March 2021

How to remove watermark from PDF file using Python's PyPDF2 lib

I have wrote a code that extracts the text from PDF file with Python and PyPDF2 lib. Code works good for most docs but sometimes it returns some strange characters. I think thats because PDF has watermark over the page so it does not recognise the text:

import requests
from io import StringIO, BytesIO
import PyPDF2

def pdf_content_extraction(pdf_link):

    all_pdf_content = ''

    #sending requests
    response = requests.get(pdf_link)
    my_raw_data = response.content


    pdf_file_text = 'PDF File: ' + pdf_link + '\n\n'
    #extract text page by page
    with BytesIO(my_raw_data) as data:
        read_pdf = PyPDF2.PdfFileReader(data)

        #looping trough each page
        for page in range(read_pdf.getNumPages()):
            page_content = read_pdf.getPage(page).extractText()
            page_content = page_content.replace("\n\n\n", "\n").strip()

            #store data into variable for each page
            pdf_file_text += page_content + '\n\nPAGE '+ str(page+1) + '/' + str(read_pdf.getNumPages()) +'\n\n\n'

    all_pdf_content += pdf_file_text + "\n\n"
        
    return all_pdf_content



pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'

print(pdf_content_extraction(pdf_link))

This is the result that I'm getting:

#$%˘˘
&'(˝˙˝˙)*+"*˜
˜*
,*˜*˜ˆ+-*˘!(
.˜($*%(#%*˜-/
"*
*˜˜0!0˘˘*˜˘˜ˆ
+˜(%
*
*(+%*˜+"*˜'
$*1˜ˆ
...
...

My question is, how can I fix this problem? Is there a way to remove watermark from page or something like that? I mean, maybe this problem can be fixed in some other way, maybe the problem is not in that watermark/logo?

from How to remove watermark from PDF file using Python's PyPDF2 lib

Hemant Vishwakarma

Saturday 13 March 2021

How to remove watermark from PDF file using Python's PyPDF2 lib

No comments:

Post a Comment