保留PDF中的选定页面

0 投票

1 回答

67 浏览

提问于 2025-04-14 17:46

我有一个叫做pdf_summary的pandas数据框，这个数据框已经排序，并且有50行独特的数据。每一行代表一个特定的文件页面组合。我该如何为每个文件名创建一个文件夹，并在每个文件夹里放一个PDF文件呢？

pdf_path = "Documents/menu.pdf"

pdf_summary

            file_name             file_pages_to_keep
  1 - Monday, Wednesday                1,3
  2 - Monday                            1
  3 - Monday, Tuesday, Wednesday       1,2,3
...
  50 - Friday                           5

我希望的结果是会有50个文件夹，每个文件夹里都有一个PDF文件，里面只包含从menu.pdf中提取的那些文件页面。

"Documents/1 - Monday, Wednesday/1 - Monday, Wednesday.pdf" (PDF only has pages 1 and 3 from menu.pdf)
...

1 个回答

首先，你需要定义一个函数，这个函数的作用是把一个pdf文件写入一个文件夹，并且可以选择你想要的页面：

import os
from PyPDF2 import PdfReader, PdfWriter

def extract_pages(input_pdf, output_pdf, pages):
    with open(input_pdf, "rb") as file:
        reader = PdfReader(file)
        writer = PdfWriter()
        for page_num in pages:
            writer.add_page(reader.pages[page_num - 1])  # Page numbers start from 0
        with open(output_pdf, "wb") as output_file:
            writer.write(output_file)

接着，你要遍历你的数据表（df）中的每一行，对于每一行，你会保存pdf文件的名字（这个名字是根据file_name这一列来的），还有你需要写入的页面：

for index, row in pdf_summary.iterrows():
    # Create a folder with the file_name if it doesn't exist
    folder_name = row['file_name']
    folder_path = os.path.join("output_folders", folder_name)
    os.makedirs(folder_path, exist_ok=True)

    # Extract pages to keep from the PDF
    file_pages_to_keep = [int(page) for page in row['file_pages_to_keep'].split(',')]
    output_pdf_path = os.path.join(folder_path, f"{folder_name}.pdf")

    # Create a new PDF with the specified pages
    extract_pages(pdf_path, output_pdf_path, file_pages_to_keep)

回答于 2025-04-14 由 Python大师

分享举报

保留PDF中的选定页面

1 个回答

撰写回答