python如何提取DOCX超链接的文本?

2024-06-09 05:54:46 发布

您现在位置:Python中文网/ 问答频道 /正文

建立在this solution上:

from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT

document = Document('test.docx')
rels = document.part.rels

def iter_hyperlink_rels(rels):
    for rel in rels:
        if rels[rel].reltype == RT.HYPERLINK:
            yield rels[rel]._target      

print(iter_hyperlink_rels(rels)

我需要获取超链接的url文本(例如,url的mydomain.com,文本的Go to My Domain


Tags: from文本importurlthisdocumentrelsolution
1条回答
网友
1楼 · 发布于 2024-06-09 05:54:46

为了回答我自己的问题,我不得不通过html来完成:

from bs4 import BeautifulSoup
with open('my_word_file.htm', 'r') as file:
    page = file.read()
soup = BeautifulSoup(page, 'lxml')

text_and_url = []
for link in soup.findAll('a'):
    text_and_url.append({'text':link.string, 'url':link.get('href')})

docx文件html的Foor转换:

how to convert .docx file to html using python?

相关问题 更多 >