使用Python正则表达式将url替换为锚标记

2024-06-11 12:08:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个HTML字符串

I was surfing http://www.google.com, where I found my tweet, 
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
<span>http://www.google.com</span>

对此

^{pr2}$

我试试这个Demo

我的python代码是

import re
p = re.compile(ur'<a\b[^>]*>.*?</a>|((ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?)', re.MULTILINE)
test_str = u"I was surfing http://www.google.com, where I found my tweet, check it out <a href=\"http://tinyurl.com/blah\">http://tinyurl.com/blah</a>"

for item in re.finditer(p, test_str):
    print item.group(0)

输出:

>>> http://www.google.com,
>>> <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>

Tags: recomhttpmycheckwwwgooglewhere
3条回答

好吧,我想我终于找到你想要的了。基本思想是尝试匹配<a href和一个URL。如果有一个<a href,那么不要做任何事情,但是如果没有,则添加链接。代码如下:

import re
test_str = """I was surfing http://www.google.com, where I found my tweet, 
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
<span>http://www.google.com</span>
"""
def repl_func(matchObj):
    href_tag, url = matchObj.groups()
    if href_tag:
        # Since it has an href tag, this isn't what we want to change,
        # so return the whole match.
        return matchObj.group(0)
    else:
        return '<a href="%s">%s</a>' % (url, url)

pattern = re.compile(
    r'((?:<a href[^>]+>)|(?:<a href="))?'
    r'((?:https?):(?:(?://)|(?:\\\\))+'
    r"(?:[\w\d:#@%/;$()~_?\+\-=\\\.&](?:#!)?)*)",
    flags=re.IGNORECASE)
result = re.sub(pattern, repl_func, test_str)
print(result)

输出:

^{pr2}$

主要思想来自https://stackoverflow.com/a/3580700/5100564。我还借用了https://stackoverflow.com/a/6718696/5100564。在

您可以使regex更加复杂,但正如mikus建议的那样,执行以下操作似乎更容易:

for item in re.finditer(p, test_str):
    result = item.group(0)
    if not "<a " in result.lower():
        print(result)

我希望这能帮助你。在

代码:

import re
p = re.compile(ur'''[^<">]((ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?)[^< ,"'>]''', re.MULTILINE)
test_str = u"I was surfing http://www.google.com, where I found my tweet, check it out <a href=\"http://tinyurl.com/blah\">http://tinyurl.com/blah</a>"

for item in re.finditer(p, test_str):
    result = item.group(0)
    result = result.replace(' ', '')
    print result
    end_result = test_str.replace(result, '<a href="' + result + '">' + result + '</a>')

print end_result

输出:

^{pr2}$

相关问题 更多 >