Sunday, 6 October 2019

Python Data Extraction from an Encrypted PDF

I am an recent graduate in pure mathematics who only has taken few basic programming courses. I am doing an internship and I have an internal data analysis project. I have to analyze the internal PDFs of the last years. The PDFs are "secured." In other words, they are encrypted. But, we have all these documents and we can read them manually. The goal is to read them with Python because is the language that most of the members of my team have some idea. Most of them are scientists.

First, I tried to read the PDFs with some Python libraries. However, the Python libraries that I found do not read encrypted PDFs. At that time, I could not export the information using Adobe Reader either.

Second, I decided to decrypt the PDFs. I was successful using the Python library pykepdf. Pykepdf works very well! However, the decrypted PDFs cannot be read as well with the Python libraries of the previous point (PyPDF2 and Tabula). At this time, we have made some improvement because using Adobe Reader I can export the information from the decrypted PDFs, but the goal is to do everything with Python.

The code that I am showing works perfectly with unencrypted PDFs, but not with encrypted PDFs. It is not working with the decrypted PDFs that were gotten with pykepdf as well.

I did not write the code. I found it in the documentation of the Python libraries Pykepdf and Tabula. The PyPDF2 solution was written by Al Sweigart in his book, "Automate the Boring Stuff with Python," that I highly recommend. I also checked that the code is working fine, with the limitations that I explained before.

First question, why I cannot read the decrypted files, if the programs work with files that never have been encrypted?

Second question, Can we read with Python the decrypted files somehow? Which library can do it or is impossible? Are all decrypted PDFs extractable?

Thank you for your time and help!!!

I found these results using Python 3.7, Windows 10, Jupiter Notebooks, and Anaconda 2019.07.

Python

import pikepdf
with pikepdf.open("encrypted.pdf") as pdf:
  num_pages = len(pdf.pages)
  del pdf.pages[-1]
  pdf.save("decrypted.pdf")

import Tabula
tabula.read_pdf("decrypted.pdf", stream=True)

import PyPDF2
pdfFileObj=open("decrypted.pdf", "rb")
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
pageObj=pdfReader.getPage(0)
pageObj.extractText()

With Tabula, I am getting the message "the output file is empty."

With PyPDF2, I am getting only '/n'

UPDATE 10/3/2019 Pdfminer.six (Version November 2018)

I got better results using the solution posted by DuckPuncher. For the decrypted file, I got the labels, but not the data. Same happens with the encrypted file. For the file that has never been encrypted works perfect. As I need the data and the labels of encrypted or decrypted files, this code does not work for me. For that analysis, I used pdfminer.six that is Python library that was released in November 2018. Pdfminer.six includes a library pycryptodome. According to their documentation "PyCryptodome is a self-contained Python package of low-level cryptographic primitives.."

The code is in the stack exchange question: Extracting text from a PDF file using PDFMiner in python?

I would love if you want to repeat my experiment. Here is the description:

1) Run the codes mention in this question with any PDF that never has been encrypted.

2) Do the same with a PDF "Secure" (this is a term that Adobe uses), I am calling it the encrypted PDF. Use a generic form that you can find using Google. After you download it, you need to fill the fields. Otherwise, you would be checking for labels, but not fields. The data is in the fields.

3) Decrypt the encrypted PDF using Pykepdf. This will be the decrypted PDF.

4) Run the codes again using the decrypted PDF.

UPDATE 10/4/2019 Camelot (Version July 2019)

I found the Python library Camelot.

It is very powerful, and works with Python 3.7. Also, it is very easy to use. However, you need also to install Ghostscript. Otherwise, it will not work. You need also to install Pandas.

The author of the program, Vinayak Mehta. Shares this code in a YouTube video that you should watch if you are interested in the topic.

I checked the code and it is working with unencrypted files. However, it does not work with encrypted and decrypted files, and that is my goal.

Camelot is oriented to get tables from PDFs.

Here is the code:

Python

import camelot
import pandas
name_table = camelot.read_pdf("uncrypted.pdf")
type(name_table)

#This is a Pandas dataframe
name_table[0]

first_table = name_table[0]   

#Translate camelot table object to a pandas dataframe
first_table.df

first_table.to_excel("unencrypted.xlsx")
#This creates an excel file.
#Same can be done with csv, json, html, or sqlite.

#To get all the tables of the pdf you need to use this code.
for table in name_table:
   print(table.df)


from Python Data Extraction from an Encrypted PDF

No comments:

Post a Comment