在字符串中查找某个url

2024-05-15 18:04:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个字符串,例如:

[{'type': 'text/html', 'value': '<table> <tr><td> <a href="https://www.reddit.com/r/wallpapers/comments/6dhhhj/waving_bear/"> <img src="https://b.thumbs.redditmedia.com/v5CaHQ_S-m4L5MUfX2a6ViwZWe2yvft_VyG8Iol0CJs.jpg" alt="Waving bear" title="Waving bear" /> </a> </td><td> &#32; submitted by &#32; <a href="https://www.reddit.com/user/mexicanwave"> /u/mexicanwave </a> <br/> <span><a href="http://i.imgur.com/PMgfJSm.jpg">[link]</a></span> &#32; <span><a href="https://www.reddit.com/r/wallpapers/comments/6dhhhj/waving_bear/">[comments]</a></span> </td></tr></table>', 'base': 'https://www.reddit.com/r/wallpapers.rss', 'language': None}]

我想从这个字符串中提取包含imgur.com的url

最简单的方法是什么


Tags: 字符串httpscomwwwtablecommentstrtd
3条回答

使用XML/HTML解析器是处理XML/HTML文档/框架集的正确方法:

from lxml import etree
from io import StringIO

data = [{'type': 'text/html', 'value': '<table> <tr><td> <a href="https://www.reddit.com/r/wallpapers/comments/6dhhhj/waving_bear/"> <img src="https://b.thumbs.redditmedia.com/v5CaHQ_S-m4L5MUfX2a6ViwZWe2yvft_VyG8Iol0CJs.jpg" alt="Waving bear" title="Waving bear" /> </a> </td><td> &#32; submitted by &#32; <a href="https://www.reddit.com/user/mexicanwave"> /u/mexicanwave </a> <br/> <span><a href="http://i.imgur.com/PMgfJSm.jpg">[link]</a></span> &#32; <span><a href="https://www.reddit.com/r/wallpapers/comments/6dhhhj/waving_bear/">[comments]</a></span> </td></tr></table>', 'base': 'https://www.reddit.com/r/wallpapers.rss', 'language': None}]

parser = etree.HTMLParser()  # creating parser instance
html_data = etree.parse(StringIO(data[0]['value']), parser)  # parser is fed with html data
url = [a.attrib['href'] for a in html_data.findall(".//a") if 'imgur.com' in a.attrib['href']]

print(url)

输出:

['http://i.imgur.com/PMgfJSm.jpg']

https://docs.python.org/3.6/library/xml.etree.elementtree.html

我建议你用漂亮的汤。因为您已经有一个HTML代码作为字符串。请参阅以下代码段。现在您已经有了所有的锚定标记,您可以进一步从theorhrefs中查找子字符串“imgur.com”,并获得特定的链接

from bs4 import BeautifulSoup

html = your_list[0].value
soup = BeautifulSoup(html)
result = soup.find("a")
myList = [{'type': 'text/html', 'value': '<table> <tr><td> <a href="https://www.reddit.com/r/wallpapers/comments/6dhhhj/waving_bear/"> <img src="https://b.thumbs.redditmedia.com/v5CaHQ_S-m4L5MUfX2a6ViwZWe2yvft_VyG8Iol0CJs.jpg" alt="Waving bear" title="Waving bear" /> </a> </td><td> &#32; submitted by &#32; <a href="https://www.reddit.com/user/mexicanwave"> /u/mexicanwave </a> <br/> <span><a href="http://i.imgur.com/PMgfJSm.jpg">[link]</a></span> &#32; <span><a href="https://www.reddit.com/r/wallpapers/comments/6dhhhj/waving_bear/">[comments]</a></span> </td></tr></table>', 'base': 'https://www.reddit.com/r/wallpapers.rss', 'language': None}]

for msg in  myList[0]['value'].split():
  if 'imgur.com' in msg:
    print(msg)

#href="http://i.imgur.com/PMgfJSm.jpg">[link]</a></span>    

相关问题 更多 >