Friday, 1 April 2022

How to trim (crop) bottom whitespace of a PDF document, in memory

I am using wkhtmltopdf to render a (Django-templated) HTML document to a single-page PDF file. I would like to either render it immediately with the correct height (which I've failed to do so far) or render it incorrectly and trim it. I'm using Python.

Attempt type 1

  • wkhtmltopdf render to a very, very long single-page PDF with a lot of extra space using --page-height
  • Use pdfCropMargins to trim: crop(["-p4", "100", "0", "100", "100", "-a4", "0", "-28", "0", "0", "input.pdf"])

The PDF is rendered perfectly with 28 units of margin at the bottom, but I had to use the filesystem to execute the crop command. It seems that the tool expects an input file and output file, and also creates temporary files midway through. So I can't use it.

Attempt type 2

  • wkhtmltopdf render to multi-page PDF with default parameters
  • Use PyPDF4 (or PyPDF2) to read the file and combine pages into a long, single page

The PDF is rendered fine-ish in most cases, however, sometimes a lot of extra white space can be seen on the bottom if by chance the last PDF page had very little content.

Ideal scenario

The ideal scenario would involve a function that takes HTML and renders it into a single-page PDF with the expected amount of white space at the bottom. I would be happy with rendering the PDF using wkhtmltopdf, since it returns bytes, and later processing these bytes to remove any extra white space. But I don't want to involve the file system in this, as instead, I want to perform all operations in memory. Perhaps I can somehow inspect the PDF directly and remove the white space manually, or do some HTML magic to determine the render height before-hand?



from How to trim (crop) bottom whitespace of a PDF document, in memory

No comments:

Post a Comment