Wednesday, 14 July 2021

How to count characters based on its font?

For every page in a given PDF file it's possible to list the fonts used:

$ pdffonts -f 10 -l 10 file.pdf
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none]                               Type 3            Custom           yes no  no      12  0
DIIDPF+ArialMT                       CID TrueType      Identity-H       yes yes yes     95  0
DIIEDH+Arial                         CID TrueType      Identity-H       yes yes no     101  0
DIIEBG+TimesNewRomanPSMT             CID TrueType      Identity-H       yes yes yes    106  0
DIIEDG+Arial                         CID TrueType      Identity-H       yes yes no     112  0
Arial                                TrueType          WinAnsi          yes no  no     121  0

I need to identify likely problematic fonts based on pdffonts output and count characters based on its font. I achieved it by implementing the following snippet:

def count_fonts_ocurrencies_by_page(pdf_filepath):
    page_layout = next(extract_pages(pdf_filepath))

    fonts = []

    for element in page_layout:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        fonts.append(character.fontname)

    return Counter(fonts)

I'm looking forward to find a straightforward way to do the same (or close, I only need to know something like a percentage of font usage on a single PDF page) without iterating every char (if possible) or maybe without using a whole module, like pdfminer, just for one function and for one PDF page at time. It would be also helpful if I could do something similar (re)using the minimum code from pdfminer, as it's built in a modular way.



from How to count characters based on its font?

No comments:

Post a Comment