For every page in a given PDF file it's possible to list the fonts used:
$ pdffonts -f 10 -l 10 file.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none] Type 3 Custom yes no no 12 0
DIIDPF+ArialMT CID TrueType Identity-H yes yes yes 95 0
DIIEDH+Arial CID TrueType Identity-H yes yes no 101 0
DIIEBG+TimesNewRomanPSMT CID TrueType Identity-H yes yes yes 106 0
DIIEDG+Arial CID TrueType Identity-H yes yes no 112 0
Arial TrueType WinAnsi yes no no 121 0
I need to identify likely problematic fonts based on pdffonts
output and count characters based on its font. I achieved it by implementing the following snippet:
def count_fonts_ocurrencies_by_page(pdf_filepath):
page_layout = next(extract_pages(pdf_filepath))
fonts = []
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
fonts.append(character.fontname)
return Counter(fonts)
I'm looking forward to find a straightforward way to do the same (or close, I only need to know something like a percentage of font usage on a single PDF page) without iterating every char (if possible) or maybe without using a whole module, like pdfminer, just for one function and for one PDF page at time. It would be also helpful if I could do something similar (re)using the minimum code from pdfminer, as it's built in a modular way.
from How to count characters based on its font?
No comments:
Post a Comment