从文件名中将单页.tif文件读取为multipage.tiff

data = [] listOfPages = glob.glob(r"C:/Users/name/test/*.tif") for entry in listOfPages: text = pytesseract.image_to_string( Image.open(entry), lang="en" ) data.append(text) df0 = pd.DataFrame(data, columns =['raw_text'])

1条回答

网友

1楼 · 发布于 2024-04-23 22:10:24

我可以通过全局搜索所有以000.tif结尾的文件来实现这一点，这些文件可能是多页文档的起点，然后添加因后缀递增而导致的文件，直到缺少一个文件为止

#!/usr/bin/env python3

import os
from PIL import Image
from glob import glob

# Iterate over all files ending in '000.tif' and find their friends (subsequent pages)
for filename in glob('*_000.tif'):
   # Work out stem of filename
   stem = filename.replace('_000.tif', '')
   print(f'DEBUG: stem={stem}')

   # Build list of images to be put in this PDF
   images = [Image.open(filename)]
   index = 1
   while True:
      this = f'{stem}_{index:03d}.tif'
      print(f'DEBUG: this={this}')
      if os.path.isfile(this):
         images.append(Image.open(this))
         index += 1
      else:
         break
   output = stem + '.pdf'
   print(f'DEBUG: Saving {len(images)} pages to {output}')
   images[0].save(output, save_all=True, append_images=images[1:])

样本输出

DEBUG: stem=Drs_1_00192_1_ADS
DEBUG: this=Drs_1_00192_1_ADS_001.tif
DEBUG: this=Drs_1_00192_1_ADS_002.tif
DEBUG: this=Drs_1_00192_1_ADS_003.tif
DEBUG: this=Drs_1_00192_1_ADS_004.tif
DEBUG: Saving 4 pages to Drs_1_00192_1_ADS.pdf
DEBUG: stem=Drs_1_00099_1_ADS
DEBUG: this=Drs_1_00099_1_ADS_001.tif
DEBUG: this=Drs_1_00099_1_ADS_002.tif
DEBUG: this=Drs_1_00099_1_ADS_003.tif
DEBUG: Saving 3 pages to Drs_1_00099_1_ADS.pdf

请注意，通过替换以下内容，您可以同样轻松地使用OpenCV来读取文件：

image = Image.open(filename)

与

image = cv2.imread(filename)

但是，你不能像使用PIL那样简单地使用OpenCV来编写PDF，所以我只能使用PIL。如果您记得PIL使用RGB排序而OpenCV使用BGR，则可以轻松地在PIL和OpenCV之间移动，因此您可以通过以下操作从PIL转到OpenCV：

OpenCVImage = np.array(PILImage)[...,::-1]

及

PILImage = Image.fromarray(OpenCVImage[...,::-1])

相关问题更多 >

编程相关推荐

热门问题

热门文章