使用python pytesseract将PDF转换为文本

import pytesseract from pdf2image import convert_from_path import glob pdfs = glob.glob(r"K:\pdf_files") for pdf_path, dirs, files in pdfs: for file in files: convert_from_path(os.path.join(pdf_path, file), 500) for pageNum,imgBlob in enumerate(pages): text = pytesseract.image_to_string(imgBlob,lang='eng') with open(f'{pdf_path}.txt', 'a') as the_file: the_file.write(text)

2条回答

网友

1楼 · 编辑于 2024-05-15 17:29:59

正如评论中提到的，您需要的是^{}，而不是glob.globos.walk递归地为您提供目录列表pdf_path是当前列出的父目录，dirs是目录/文件夹列表，files是该文件夹中的文件列表

使用^{}使用父文件夹和文件名形成完整路径

另外，与其不断地附加到txt文件，不如在“从页面到文本”循环之外创建它

import os

pdfs_dir = r"K:\pdf_files"

for pdf_path, dirs, files in os.walk(pdfs_dir):
    for file in files:
        if not file.lower().endswith('.pdf'):
            # skip non-pdf's
            continue
        
        file_path = os.path.join(pdf_path, file)
        pages = convert_from_path(file_path, 500)
        
        # change the file extension from .pdf to .txt, assumes
        # just one occurrence of .pdf in the name, as the extension
        with open(f'{file_path.replace(".pdf", ".txt")}', 'w') as the_file:  # write mode, coz one time
            for pageNum, imgBlob in enumerate(pages):
                text = pytesseract.image_to_string(imgBlob,lang='eng')
                the_file.write(text)

网友

2楼 · 编辑于 2024-05-15 17:29:59

我刚刚以一种更简单的方式解决了这个问题，添加了*来指定目录中的所有子目录：

import pytesseract
from pdf2image import convert_from_path
import glob

pdfs = glob.glob(r"K:\pdf_files\*\*.pdf")

for pdf_path in pdfs:
    pages = convert_from_path(pdf_path, 500)

    for pageNum,imgBlob in enumerate(pages):
        text = pytesseract.image_to_string(imgBlob,lang='eng')

        with open(f'{pdf_path}.txt', 'a') as the_file:
            the_file.write(text)

相关问题更多 >

编程相关推荐

热门问题

热门文章