python从string获取链接

{'Sender': 'Geometry Dash', 'Subject': 'Please activate your account.', 'body': b'<style type="text/css">\n#google_translate_element{\n float: right;\n padding:0 0 10px 10px;\n}\n/* twitter do\xc4\x9frulama linki fix */\n.bulletproof-btn-1 a {\n font-size: 20px!important;\n color: #fff!important;\n padding: 20px!important;\n line-height: 33px!important;\n text-decoration: none!important;\n}\n</style>\n<div id="google_translate_element"></div><script type="text/javascript">\nfunction googleTranslateElementInit() {\n new google.translate.TranslateElement({pageLanguage: \'en\', layout: google.translate.TranslateElement.InlineLayout.SIMPLE, autoDisplay: false, multilanguagePage: true}, \'google_translate_element\');\n}\n</script><script type="text/javascript" src="//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit"></script>\n\r\n\r\n<html>\r\n<head>\r\n\t<title></title>\r\n</head>\r\n<body>\r\n<p>Thank you for registering a Geometry Dash account</p>\r\n\r\n<p>Your account information:<br />\r\nUsername:  SUKAFUTCUCK</p>\r\n\r\n<p>Please click the link below to activate your account:<br />\r\n<a href="http://www.boomlings.com/database/accounts/activate.php?uid=8722046&actcode=xlCReGjLdkWmINt1GY9e" target="_blank">Click\r\nHere</a></p>\r\n\r\n<p>Please contact support@robtopgames.com if you have any questions or\r\nneed assistance.</p>\r\n\r\n<p>If you did not send an account request using this email, then you\r\ncan safely disregard this message and nothing will happen.</p>\r\n\r\n<p>Regards,<br />\r\nRobTop Games</p>\r\n</body>\r\n</html>\r\n\r\n\r\n'}

3条回答

网友

1楼 · 编辑于 2024-04-20 01:14:30

电子邮件可以是HTML或文本格式。如果它是HTML格式的，那么就使用bs4、pyquery等库

如果是文本，那么使用regex来搜索URL，使用下面的regex

regex = ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

参考：http://www.ietf.org/rfc/rfc3986.txt

使用re模块搜索字符串

import re
regex = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
urls = re.findall( regex, text )
print(urls)

使用pyquery模块

from pyquery import pyQuery as pq
q = pq( text )
a_list = q( "a" )
urls = [ a.attr[ 'href' ] for a in a_list ]
print(urls)

编辑：

我们可以使用特定的URL来代替通用URL，例如https?:\/\/www\.boomlings\.com\/database\/accounts\/activate\.php\?uid=.*&actcode=.*

https://ideone.com/NFj90L

网友

2楼 · 编辑于 2024-04-20 01:14:30

假设您描述中的dict现在位于名为d的变量中（在这里输入有点长）：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(d['body'], 'lxml')
>>> link = soup.find('a', target='_blank')
>>> link['href']
'http://www.boomlings.com/database/accounts/activate.php?uid=8722046&actcode=xlCReGjLdkWmINt1GY9e'

BeautifulSoup docs

网友

3楼 · 编辑于 2024-04-20 01:14:30

您可以将regex用于以下内容：

import re
c = re.search("<a href=\".*?(?=\")", yourDict["body"].decode("utf-8"))
print(c.group())

但是如果您找到一个像parsel这样的包会更好，因为您使用xpath和not with regex, check this提取html

编辑

我使用正则表达式是因为它是最短最快的方法，不需要下载包，但是如果您的响应发生了很大的变化，我建议使用parsel。示例：

from parsel import Selector
sel = Selector(text=yourDict["body"].decode("utf-8"))
url = sel.xpath('//a[@target="_blank"]/@href').extract_first()

相关问题更多 >

编程相关推荐

热门问题

热门文章