检测字符串中的URL并用"<a href..."标签包裹

10 投票
2 回答
4682 浏览
提问于 2025-04-15 12:37

我想写一个看起来应该很简单的东西,但不知道为什么我就是搞不懂。

我想写一个Python函数,当给它一个字符串时,它会把这个字符串中的网址用HTML编码包裹起来。

unencoded_string = "This is a link - http://google.com"

def encode_string_with_links(unencoded_string):
    # some sort of regex magic occurs
    return encoded_string

print encoded_string

'This is a link - <a href="http://google.com">http://google.com</a>'

谢谢!

2 个回答

11

谷歌上找到的解决方案:

#---------- find_urls.py----------#
# Functions to identify and extract URLs and email addresses

import re

def fix_urls(text):
    pat_url = re.compile(  r'''
                     (?x)( # verbose identify URLs within text
         (http|ftp|gopher) # make sure we find a resource type
                       :// # ...needs to be followed by colon-slash-slash
            (\w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
                      (/?| # could be just the domain name (maybe w/ slash)
                [^ \n\r"]+ # or stuff then space, newline, tab, quote
                    [\w/]) # resource name ends in alphanumeric or slash
         (?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
                         ) # end of match group
                           ''')
    pat_email = re.compile(r'''
                    (?xm)  # verbose identify URLs in text (and multiline)
                 (?=^.{11} # Mail header matcher
         (?<!Message-ID:|  # rule out Message-ID's as best possible
             In-Reply-To)) # ...and also In-Reply-To
                    (.*?)( # must grab to email to allow prior lookbehind
        ([A-Za-z0-9-]+\.)? # maybe an initial part: DAVID.mertz@gnosis.cx
             [A-Za-z0-9-]+ # definitely some local user: MERTZ@gnosis.cx
                         @ # ...needs an at sign in the middle
              (\w+\.?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
         (?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
                         ) # end of match group
                           ''')

    for url in re.findall(pat_url, text):
       text = text.replace(url[0], '<a href="%(url)s">%(url)s</a>' % {"url" : url[0]})

    for email in re.findall(pat_email, text):
       text = text.replace(email[1], '<a href="mailto:%(email)s">%(email)s</a>' % {"email" : email[1]})

    return text

if __name__ == '__main__':
    print fix_urls("test http://google.com asdasdasd some more text")

编辑:已根据你的需求进行了调整

11

你需要的“正则表达式魔法”其实就是 sub(它用于替换):

def encode_string_with_links(unencoded_string):
  return URL_REGEX.sub(r'<a href="\1">\1</a>', unencoded_string)

URL_REGEX 可以是这样的:

URL_REGEX = re.compile(r'''((?:mailto:|ftp://|http://)[^ <>'"{}|\\^`[\]]*)''')

这个正则表达式对网址的要求比较宽松:它允许使用 mailto、http 和 ftp 等协议,之后基本上会一直匹配,直到遇到一个“不安全”的字符(除了百分号,因为你需要它来表示转义字符)。如果你需要的话,可以让它更严格一些。例如,你可以要求百分号后面必须跟一个有效的十六进制转义,或者只允许一个井号(用于片段),或者强制查询参数和片段的顺序。不过,这些基本的规则应该足够你入门了。

撰写回答