Hemant Vishwakarma: read single page .tif files as multipage.tiff from filename

Monday, 18 October 2021

read single page .tif files as multipage.tiff from filename

UPDATE: I found out it is unreasonable to create pdf files from OCRed files

So it would be better to leave it as is without conversion. I still have the problem that some images are connected while others are 1 pagers.

data = []
listOfPages = glob.glob(r"C:/Users/name/test/*.tif")
for entry in listOfPages:
    text = pytesseract.image_to_string(
            Image.open(entry), lang="en"
        )
    data.append(text)
df0 = pd.DataFrame(data, columns =['raw_text'])

This creates a pandas df where each row is the string of the first (single) page of .tif files. How can i concatenate the tif files (see original question) in order to get the full multipage string?

original question: I want to convert the single page .tif files in my_folder to multipage .pdf files in pdf_folder. TIFFs not having subsequent pages should also be converted to single-page PDFs. Ultimately, I want a text-PDF created by OCR-ing multiple image-based TIFF files.

Therefore i infer the groups of .tiff files that should go together from the filename pattern:

Drs_1_00109_1_ADS.tif
Drs_1_00099_1_ADS_000.tif
Drs_1_00099_1_ADS_001.tif
Drs_1_00099_1_ADS_002.tif
Drs_1_00186_1_ADS.tif
Drs_1_00192_1_ADS_000.tif
Drs_1_00192_1_ADS_001.tif

For example out of Drs_1_00192_1_ADS_000.tif and Drs_1_00192_1_ADS_001.tif (which are two [single page] pictures) i want to convert to the 2 page Drs_1_00192_1_ADS.pdf having both of these pictures text data. The code works for single-page pdf creation. How can i make this work for said multipage-pattern from filename?

Thanks!

from read single page .tif files as multipage.tiff from filename

Hemant Vishwakarma

Monday, 18 October 2021

read single page .tif files as multipage.tiff from filename

No comments:

Post a Comment