如何用Python从doc/docx文件中提取数据

10 投票

6 回答

48767 浏览

提问于 2025-04-18 00:38

我知道网上有类似的问题，但我找不到能解决我困扰的答案。我需要一种方法，从MS-Word文件中提取某些数据，并把它保存到一个XML文件里。我查了一下python-docx，但发现它似乎只能往Word文档里写东西，而不能读取。为了更清楚地说明我的任务（或者说我打算怎么做）：我想在文档中搜索一个关键词或短语（文档里有表格），然后从找到这个关键词/短语的表格中提取文本数据。有没有人有什么好主意？

xml文件数据提取文件格式转换文档处理文本分析表格数据关键词搜索 ms word

6 个回答

这是一个更简单的库，可以提取图片。

pip install docx2txt

接下来，使用下面的代码来读取docx文件。

import docx2txt
text = docx2txt.process("file.docx")

回答于 2025-04-18 由 Python大师

分享举报

使用Python从doc/docx文件中提取文本

import os
import docx2txt
from win32com import client as wc

def extract_text_from_docx(path):
    temp = docx2txt.process(path)
    text = [line.replace('\t', ' ') for line in temp.split('\n') if line]
    final_text = ' '.join(text)
    return final_text

def extract_text_from_doc(doc_path):
    w = wc.Dispatch('Word.Application')
    doc = w.Documents.Open(file_path)
    doc.SaveAs(save_file_name, 16)
    doc.Close()
    w.Quit()
    joinedPath = os.path.join(root_path,save_file_name)
    text = extract_text_from_docx(joinedPath)
    return text

def extract_text(file_path, extension):
    text = ''
    if extension == '.docx':
       text = extract_text_from_docx(file_path)
    else extension == '.doc':
       text = extract_text_from_doc(file_path)
return text

file_path = #file_path with doc/docx file
root_path = #file_path where the doc downloaded
save_file_name = "Final2_text_docx.docx"
final_text = extract_text(file_path, extension)
print(final_text)

回答于 2025-04-18 由 Python大师

分享举报

要在文档中使用python-docx进行搜索

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')

# Search returns true if found    
search(document,'your search string')

你还可以使用一个函数来获取文档的文本：

https://github.com/mikemaccana/python-docx/blob/master/docx.py#L910

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')
fullText=getdocumenttext(document)

使用 https://github.com/mikemaccana/python-docx

回答于 2025-04-18 由 Python大师

分享举报

docx文件其实是一个压缩包，里面包含了文档的XML文件。你可以打开这个压缩包，查看文档内容，并使用ElementTree来解析数据。

这种方法的好处是，你不需要安装任何额外的Python库。

import zipfile
import xml.etree.ElementTree

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'

with zipfile.ZipFile('<path to docx file>') as docx:
    tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))

for table in tree.iter(TABLE):
    for row in table.iter(ROW):
        for cell in row.iter(CELL):
            print ''.join(node.text for node in cell.iter(TEXT))

想了解更多细节和参考资料，可以看看我在StackOverflow上的回答，链接是如何使用Python读取MS-Word文件中的表格内容？

针对下面的一个评论，提到提取图片并没有那么简单。我创建了一个空的docx文件，并插入了一张图片。然后我把这个docx文件当作压缩包打开（用7zip），查看了document.xml。所有的图片信息都是以属性的形式存储在XML中，而不是像文本那样用CDATA格式。所以你需要找到你感兴趣的标签，然后提取你想要的信息。

比如可以在上面的脚本中添加：

IMAGE = '{http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}' + 'docPr'

for image in tree.iter(IMAGE):
    print image.attrib

输出结果是：

{'id': '1', 'name': 'Picture 1'}

我对openxml格式不是很专业，但希望这些信息对你有帮助。

我注意到这个压缩包里有一个叫media的文件夹，里面有一个叫image1.jpeg的文件，里面是我嵌入的图片的重命名副本。你可以在docx的压缩包里四处看看，了解里面有什么内容。

回答于 2025-04-18 由 Python大师

分享举报

看起来pywin32这个工具可以解决问题。你可以遍历文档中的所有表格，还可以查看每个表格里的所有单元格。获取数据有点麻烦，因为每个条目的最后两个字符需要去掉，不过其他的部分写代码大概只需要十分钟。如果有人需要更多的细节，可以在评论里说一下。

回答于 2025-04-18 由 Python大师

分享举报

如何用Python从doc/docx文件中提取数据

6 个回答

撰写回答