查找并追加每个引用到HTML链接 - Python

1 投票

5 回答

709 浏览

提问于 2025-04-16 13:09

我有一个从维基百科下载的HTML文件，我想在这个页面上找到所有类似于/wiki/Absinthe的链接，然后把它们替换成当前目录加在前面的形式，比如/home/fergus/wikiget/wiki/Absinthe。所以：

<a href="/wiki/Absinthe">Absinthe</a>

就变成：

<a href="/home/fergus/wikiget/wiki/Absinthe">Absinthe</a>

而且这个操作要在整个文档中进行。

你有什么好主意吗？我很乐意使用BeautifulSoup或者正则表达式！

5 个回答

你可以使用一个函数配合 re.sub 来实现这个功能：

def match(m):
    return '<a href="/home/fergus/wikiget' + m.group(1) + '">'

r = re.compile(r'<a\shref="([^"]+)">')
r.sub(match, yourtext)

下面是一个例子：

>>> s = '<a href="/wiki/Absinthe">Absinthe</a>'
>>> r.sub(match, s)
'<a href="/home/fergus/wikiget/wiki/Absinthe">Absinthe</a>'

回答于 2025-04-16 由 Python大师

分享举报

如果你真的只需要做这些事情，可以用 sed 命令和它的 -i 选项直接修改文件：

sed -e 's,href="/wiki,href="/home/fergus/wikiget/wiki,' wiki-file.html

不过，这里有一个使用好用的 lxml 库的 Python 解决方案，适合你需要处理更复杂的情况或者遇到格式不太好的 HTML 等等：

from lxml import etree
import re

parser = etree.HTMLParser()

with open("wiki-file.html") as fp:
    tree = etree.parse(fp, parser)

for e in tree.xpath("//a[@href]"):
    link = e.attrib['href']
    if re.search('^/wiki',link):
        e.attrib['href'] = '/home/fergus/wikiget'+link

# Or you can just specify the same filename to overwrite it:
with open("wiki-file-rewritten.html","w") as fp:
    fp.write(etree.tostring(tree))

需要注意的是，现在对于这种任务， lxml 可能比 BeautifulSoup 更合适，原因可以参考 BeautifulSoup 作者给出的说明。

回答于 2025-04-16 由 Python大师

分享举报

这是一个使用 re 模块的解决方案：

#!/usr/bin/env python
import re

open('output.html', 'w').write(re.sub('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe', open('file.html').read()))

这里还有一个不使用 re 的解决方案：

#!/usr/bin/env python
open('output.html', 'w').write(open('file.html').read().replace('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe'))

回答于 2025-04-16 由 Python大师

分享举报

查找并追加每个引用到HTML链接 - Python

5 个回答

撰写回答