Saturday, 20 May 2023

Redacted / highlighted PDF becomes too big with this script. Can it be improved?

A few years ago I asked this question. I wanted to extract my Kindle annotations from the MyClippings.txt file and use them to annotate a PDF version of the original text. Very useful for academic reading (e.g., having the annotated original PDF is more useful for skimming and citing). A few months ago I found a solution in the following script.

import fitz

# the document to annotate
doc = fitz.open("text_to_highlight.pdf")

# the text to be marked
text_list = [
    "first piece of text", 
    "second piece of text",
    "third piece of text"
        ]

for page in doc:
    for text in text_list:
        rl = page.search_for(text, quads = True)
        page.add_highlight_annot(rl)

# save to a new PDF
doc.save("text_annotated.pdf")

I found however a new problem since then. The PDF output, on a 700 pages book, becomes incredibly big (more than 500M). (The script had to be run a few times,because with all the annotations at once it would crash; this is not necessarily a problem but it suggests inefficiency). Is there an approach---my guess is Python-based---that could prevent such inefficient outcome?



from Redacted / highlighted PDF becomes too big with this script. Can it be improved?

No comments:

Post a Comment