Monday, 18 October 2021

read single page .tif files as multipage.tiff from filename

UPDATE: I found out it is unreasonable to create pdf files from OCRed files

So it would be better to leave it as is without conversion. I still have the problem that some images are connected while others are 1 pagers.

data = []
listOfPages = glob.glob(r"C:/Users/name/test/*.tif")
for entry in listOfPages:
    text = pytesseract.image_to_string(
            Image.open(entry), lang="en"
        )
    data.append(text)
df0 = pd.DataFrame(data, columns =['raw_text'])

This creates a pandas df where each row is the string of the first (single) page of .tif files. How can i concatenate the tif files (see original question) in order to get the full multipage string?

original question: I want to convert the single page .tif files in my_folder to multipage .pdf files in pdf_folder. TIFFs not having subsequent pages should also be converted to single-page PDFs. Ultimately, I want a text-PDF created by OCR-ing multiple image-based TIFF files.

Therefore i infer the groups of .tiff files that should go together from the filename pattern:

Drs_1_00109_1_ADS.tif
Drs_1_00099_1_ADS_000.tif
Drs_1_00099_1_ADS_001.tif
Drs_1_00099_1_ADS_002.tif
Drs_1_00186_1_ADS.tif
Drs_1_00192_1_ADS_000.tif
Drs_1_00192_1_ADS_001.tif

For example out of Drs_1_00192_1_ADS_000.tif and Drs_1_00192_1_ADS_001.tif (which are two [single page] pictures) i want to convert to the 2 page Drs_1_00192_1_ADS.pdf having both of these pictures text data. The code works for single-page pdf creation. How can i make this work for said multipage-pattern from filename?

Thanks!



from read single page .tif files as multipage.tiff from filename

No comments:

Post a Comment