I have wrote a code that extracts the text from PDF file with Python and PyPDF2 lib. Code works good for most docs but sometimes it returns some strange characters. I think thats because PDF has watermark over the page so it does not recognise the text:
import requests
from io import StringIO, BytesIO
import PyPDF2
def pdf_content_extraction(pdf_link):
all_pdf_content = ''
#sending requests
response = requests.get(pdf_link)
my_raw_data = response.content
pdf_file_text = 'PDF File: ' + pdf_link + '\n\n'
#extract text page by page
with BytesIO(my_raw_data) as data:
read_pdf = PyPDF2.PdfFileReader(data)
#looping trough each page
for page in range(read_pdf.getNumPages()):
page_content = read_pdf.getPage(page).extractText()
page_content = page_content.replace("\n\n\n", "\n").strip()
#store data into variable for each page
pdf_file_text += page_content + '\n\nPAGE '+ str(page+1) + '/' + str(read_pdf.getNumPages()) +'\n\n\n'
all_pdf_content += pdf_file_text + "\n\n"
return all_pdf_content
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
print(pdf_content_extraction(pdf_link))
This is the result that I'm getting:
#$%˘˘
&'(˝˙˝˙)*+"*˜
˜*
,*˜*˜ˆ+-*˘!(
.˜($*%(#%*˜-/
"*
*˜˜0!0˘˘*˜˘˜ˆ
+˜(%
*
*(+%*˜+"*˜'
$*1˜ˆ
...
...
My question is, how can I fix this problem? Is there a way to remove watermark from page or something like that? I mean, maybe this problem can be fixed in some other way, maybe the problem is not in that watermark/logo?
from How to remove watermark from PDF file using Python's PyPDF2 lib
No comments:
Post a Comment