如何用python从嵌入的链接中提取链接？

<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>

3条回答

网友

1楼 · 编辑于 2024-04-26 04:03:15

import re

string = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'

m = re.search( r'href=https%3A%2F%2F(.*)&width', string)
str2 = m.group(1)
str2.replace('%2F', '/')

输出

^{pr2}$

网友

2楼 · 编辑于 2024-04-26 04:03:15

Here是一些有关Regex的有用信息，可以在Python中查找url。在

如果您编写的所有url都将在.php?href=之后开始工作，那么您可以创建一个循环，在找到?href=时停止并拆分字符串。在

或者您可以使用$_GET[]并将其打印出来，here是您可能想阅读的另一篇文章。在

网友

3楼 · 编辑于 2024-04-26 04:03:15

我想用beautiful soup代替会更好。在

要分析的文本是带有src的iframe标记。您正在尝试在src属性中检索href=之后和&width之前的url。在

之后，您需要将url解码回文本。在

首先，你把它扔进漂亮的汤里，然后把它的属性去掉：

text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)

src_attribute = soup.find("iframe")["src"]

然后你可以在这里使用regex或者使用.split()（相当老套）：

^{pr2}$

最后，您需要使用^{}对url进行解码：

link = urllib2.unquote(link)

你完了！在

因此产生的代码是：

from bs4 import BeautifulSoup
import urllib2
import re

text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)

src_attribute = soup.find("iframe")["src"]

# Regex
link = re.findall('.*?href=(.*)?&', src_attribute)[0]
# .split()
link = src_attribute.split("href=")[1].split("&")[0]

link = urllib2.unquote(link)

相关问题更多 >

编程相关推荐

热门问题

热门文章