如何使用python从docx文件中提取格式化数据 - 问答 - Python中文网

如何使用python从docx文件中提取格式化数据

2024-04-23 09:53:46 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

示例：我有非常相似的内容如下docx文件

Introduction
A. This is text
This is second text
1.1 more complex st
Yes it is
I. Now Roman

我想将输出存储在json数据结构中。上面应该是

输出

{'A': 'This is text', '1': 'This is second text', '1.1': 'more complex st', '2': 'Yes it is', 'I': 'Now Roman'}

我现在的代码是

from docx import Document

document = Document('myDoc.docx')

for para in document.paragraphs:
    print para.text

但是这个代码的问题是段落文本不包含段落编号。它只包含段落内容。例子对于“A.这是文本”，段落文本只包含“这是文本”，但我想要“A。这是文本”。在

谢谢

Tags： text 文本内容 is more it this now

2条回答

网友

1楼 · 编辑于 2024-04-23 09:53:46

首先，使用插件（https://github.com/thepankajsingh/extract-doc-add-ins）将Doc/Word转换为HTML。现在您可以轻松地解析HTML来获得键值对。在

网友

2楼 · 编辑于 2024-04-23 09:53:46

使用python docx模块

像这样读取数据：

from docx import Document


document = Document('test.docx')

for para in document.paragraphs:
    print para.text

一旦你有了数据，你就可以建立你的字典了

相关问题更多 >

编程相关推荐

热门问题

热门文章