如何使用pymupdf从较大的pdf中选择的页面中提取文本？

--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-23-c05917f260e7> in <module>() 6 # print(selection) 7 # text = doc.get_page_text(3, "text") ----> 8 text = selection.getText(s) 9 text AttributeError: 'NoneType' object has no attribute 'getText'

1条回答

网友

1楼 · 发布于 2024-04-26 07:48:13

select这里，根据the documentation，在内部修改doc，不返回任何内容。在Python中，如果函数没有显式返回任何内容，它将返回None，这就是您看到该错误的原因

但是，Document提供了一个名为get_page_text的method，允许您从特定页面（0索引）获取文本。因此，对于您的示例，您可以写：

import fitz
s = [1, 2] # pages 2 and 3
doc = fitz.open('linear_regression.pdf')
text_by_page = [doc.get_page_text(i) for i in s]

现在，您有了一个列表，其中列表中的每个项目都是来自不同所需页面的文本。将其转换为字符串的简单方法是：

text = ' '.join(text_by_page)

它在第一页的最后一个字和最后一页的第一个字之间用空格连接两页（好像根本没有分页符）

相关问题更多 >

编程相关推荐

热门问题

热门文章