如何在浏览器不支持框架且无法直接访问框架时自动获取框架内容

4 投票

1 回答

5620 浏览

数据工程师

提问于 2025-04-18 13:57

我想自动下载一些PDF文件，这些文件的链接像这样，目的是为了建立一个联合国决议的资料库。

当我用beautiful soup或者mechanize打开这个链接时，出现了“你的浏览器不支持框架”的提示。如果我在Chrome的开发者工具中使用“复制为curl”功能，也会遇到同样的问题。

针对“你的浏览器不支持框架”的问题，通常的建议是去查看每个框架的源代码，然后加载那个框架。但是如果我这么做，就会看到一个错误信息，提示页面没有授权。

我该怎么继续呢？我想我可以试试zombie或者phantom这些工具，但我对它们不太熟悉，所以更希望能找到其他方法。

网页抓取 beautiful soup mechanize 自动下载 PDF文件框架支持 zombie phantom

1 个回答

好的，这个任务用到的东西挺有意思的，主要是 requests 和 BeautifulSoup。

这里面有一些重要的调用，涉及到 un.org 和 daccess-ods.un.org，这些调用会设置相关的 cookies。这就是为什么你需要保持 requests.Session()，并且在访问 pdf 之前要先访问几个网址。

下面是完整的代码：

import re
from urlparse import urljoin

from bs4 import BeautifulSoup
import requests


BASE_URL = 'http://www.un.org/en/ga/search/'
URL = "http://www.un.org/en/ga/search/view_doc.asp?symbol=A/RES/68/278"
BASE_ACCESS_URL = 'http://daccess-ods.un.org'

# start session
session = requests.Session()
response = session.get(URL, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})

# get frame links
soup = BeautifulSoup(response.text)
frames = soup.find_all('frame')
header_link, document_link = [urljoin(BASE_URL, frame.get('src')) for frame in frames]

# get header
session.get(header_link, headers={'Referer': URL})

# get document html url
response = session.get(document_link, headers={'Referer': URL})
soup = BeautifulSoup(response.text)

content = soup.find('meta', content=re.compile('URL='))['content']
document_html_link = re.search('URL=(.*)', content).group(1)
document_html_link = urljoin(BASE_ACCESS_URL, document_html_link)

# follow html link and get the pdf link
response = session.get(document_html_link)
soup = BeautifulSoup(response.text)

# get the real document link
content = soup.find('meta', content=re.compile('URL='))['content']
document_link = re.search('URL=(.*)', content).group(1)
document_link = urljoin(BASE_ACCESS_URL, document_link)
print document_link

# follow the frame link with login and password first - would set the important cookie
auth_link = soup.find('frame', {'name': 'footer'})['src']
session.get(auth_link)

# download file
with open('document.pdf', 'wb') as handle:
    response = session.get(document_link, stream=True)

    for block in response.iter_content(1024):
        if not block:
            break

        handle.write(block)

你可能应该把一些代码块提取到函数里，这样会让代码更易读，也更方便重复使用。

顺便说一下，所有这些操作其实可以通过真正的浏览器来更简单地完成，借助 selenium 或者 Ghost.py。

希望这些信息对你有帮助。

回答于 2025-04-18 由 Python大师

分享举报

如何在浏览器不支持框架且无法直接访问框架时自动获取框架内容

1 个回答

撰写回答