python beautifulsoup 提取 iframe 文档 html

15 投票

1 回答

31897 浏览

提问于 2025-04-18 02:28

我正在尝试学习一些关于Beautiful Soup的知识，想从一些iFrame中提取一些HTML数据，但到目前为止没有太成功。

解析iFrame本身似乎在使用BS4时没有问题，但我就是无法获取到里面嵌入的内容，无论我怎么做。

比如，看看下面这个iFrame（这是我在Chrome开发者工具中看到的）：

<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"
src="http://www.engineeringmaterials.com/boron/728x90.html "width="728" height="90">
#document <html>....</html></iframe>

其中，<html>...</html>是我想提取的内容。

但是，当我使用以下的BS4代码时：

iFrames=[] # qucik bs4 example
for iframe in soup("iframe"):
    iFrames.append(soup.iframe.extract())

我得到的是：

<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO" src="http://www.engineeringmaterials.com/boron/728x90.html" width="728" height="90">

换句话说，我得到了iFrame，但里面的文档<html>...</html>却没有。

我尝试过类似这样的代码：

iFrames=[] # qucik bs4 example
iframexx = soup.find_all('iframe')
for iframe in iframexx:
    print iframe.find_all('html')

..但这似乎也不管用..

所以，我想问的是，怎样才能可靠地从iFrame元素中提取这些文档对象<html>...</html>呢？

data extraction web scraping beautiful soup html parsing iframe bs4 document object model chrome developer tools

1 个回答

浏览器会单独请求加载iframe里的内容。你也需要这样做：

for iframe in iframexx:
    response = urllib2.urlopen(iframe.attrs['src'])
    iframe_soup = BeautifulSoup(response)

记住：BeautifulSoup 不是浏览器；它不会帮你获取图片、CSS和JavaScript资源。

回答于 2025-04-18 由 Python大师

分享举报

python beautifulsoup 提取 iframe 文档 html

1 个回答

撰写回答