如何处理URL打开器中的重定向

1 投票

3 回答

3538 浏览

数据工程师

提问于 2025-04-18 13:04

抱歉问了个初学者的问题。我想知道在Python中有没有一个高效的URL打开类，可以处理重定向。我现在使用的是简单的urllib.urlopen()，但它不太好用。这里有个例子：

http://thetechshowdown.com/Redirect4.php

对于这个网址，我用的类没有跟随重定向到：

http://www.bhphotovideo.com/

而是只显示：

"您正在被自动重定向到B&H。

页面卡住了？点击这里。"

提前谢谢你。

网络编程 http请求网络爬虫 urllib 重定向请求库 url处理自动重定向

3 个回答

HTML中的meta刷新重定向网址可能看起来像这样：

相对网址：

<meta http-equiv="refresh" content="0; url=legal_notices_en.htm#disclaimer">

带有引号的引号：

<meta http-equiv="refresh" content="0; url='legal_notices_en.htm#disclaimer'">

标签中的content使用大写字母：

<meta http-equiv="refresh" content="0; URL=legal_notices_en.htm#disclaimer">

总结：

使用lxml.xml来解析HTML，
用lower()和两个split()来提取网址部分，
去掉可能的引号和空格，
获取绝对网址，
将结果缓存到本地文件中，使用shelves（如果你有很多网址需要测试，这样做很有用）。

用法：

print get_redirections('https://www.google.com')

返回的结果可能是这样的：

{'final': u'https://www.google.be/?gfe_rd=fd&ei=FDDASaSADFASd', 'history': [<Response [302]>]}

代码：

from urlparse import urljoin, urlparse
import urllib, shelve, lxml, requests
from lxml import html

def get_redirections(initial_url, url_id = None):
    if not url_id:
        url_id = initial_url
    documents_checked = shelve.open('tested_urls.log')
    if url_id in documents_checked:
        print 'cached'
        output = documents_checked[url_id]
    else:
        print 'not cached'
        redirecting = True
        history = []
        try:
            current_url = initial_url
            while redirecting:
                r = requests.get(current_url)
                final = r.url
                history += r.history
                status = {'final':final,'history':history}

                html = lxml.html.fromstring(r.text.encode('utf8'))
                refresh = html.cssselect('meta[http-equiv="refresh"]')
                if refresh:
                    refresh_content = refresh[0].attrib['content']

                    current_url = refresh_content.lower().split('url=')[1].split(';')[0]
                    before_stripping = ''
                    after_stripping = current_url

                    while before_stripping != after_stripping:
                        before_stripping = after_stripping
                        after_stripping = before_stripping.strip('"').strip("'").strip()

                    current_url = urljoin(final, after_stripping)
                    history += [current_url]

                else:
                    redirecting = False

        except requests.exceptions.RequestException as e:
            status = {'final':str(e),'history':[],'error':e}

        documents_checked[url_id] = status
        output = status

    documents_checked.close()
    return output

url = 'http://google.com'
print get_redirections(url)

回答于 2025-04-18 由 Python大师

分享举报

这个问题是因为出现了软重定向。urllib没有跟随这些重定向，因为它不把这些当作重定向来处理。实际上，HTTP响应代码200（页面找到）被返回，而重定向是在浏览器中通过某种副作用发生的。

第一个页面返回了HTTP响应代码200，但里面包含了以下内容：

<meta http-equiv="refresh" content="1; url=http://fave.co/1idiTuz">

这段内容指示浏览器去跟随这个链接。第二个资源会返回HTTP响应代码301或302（重定向），指向另一个资源，这时又会发生第二次软重定向，这次是通过Javascript实现的：

<script type="text/javascript">
    setTimeout(function () {window.location.replace(\'http://bhphotovideo.com\');}, 2.75 * 1000);
</script>
<noscript>
    <meta http-equiv="refresh" content="2.75;http://bhphotovideo.com" />
</noscript>

不幸的是，你需要手动提取要跟随的URL。不过，这并不难。这里是代码：

from lxml.html import parse
from urllib import urlopen
from contextlib import closing

def follow(url):
    """Follow both true and soft redirects."""
    while True:
        with closing(urlopen(url)) as stream:
            next = parse(stream).xpath("//meta[@http-equiv = 'refresh']/@content")
            if next:
                url = next[0].split(";")[1].strip().replace("url=", "")
            else:
                return stream.geturl()

print follow("http://thetechshowdown.com/Redirect4.php")

错误处理的部分就留给你自己去做了 :) 另外要注意，如果目标页面也包含一个<meta>标签，这可能会导致无限循环。虽然在你的情况下不是这样，但你可以添加一些检查来防止这种情况：比如在重定向次数达到n后停止，或者检查页面是否重定向到自己，这些都是不错的选择。

你可能还需要安装lxml这个库。

回答于 2025-04-18 由 Python大师

分享举报

使用 requests 模块 - 它默认会自动处理重定向。

不过，有些页面可能会通过 JavaScript 来进行重定向，这种情况下，任何模块都无法跟随这种重定向。

你可以在浏览器中关闭 JavaScript，然后访问 http://thetechshowdown.com/Redirect4.php，看看它是否会把你重定向到其他页面。

我检查过这个页面 - 它有 JavaScript 重定向和 HTML 重定向（带有 "refresh" 参数的标签）。这两种重定向都不是服务器正常发送的重定向，所以任何模块都无法跟随这些重定向。你需要查看页面，找到代码中的 URL，然后直接连接到那个 URL。

import requests
import lxml, lxml.html

# started page

r = requests.get('http://thetechshowdown.com/Redirect4.php')

#print r.url
#print r.history
#print r.text

# first redirection

html = lxml.html.fromstring(r.text)

refresh = html.cssselect('meta[http-equiv="refresh"]')

if refresh:
    print 'refresh:', refresh[0].attrib['content']
    x = refresh[0].attrib['content'].find('http')
    url = refresh[0].attrib['content'][x:]
    print 'url:', url

r = requests.get(url)

#print r.text

# second redirection

html = lxml.html.fromstring(r.text)

refresh = html.cssselect('meta[http-equiv="refresh"]')

if refresh:
    print 'refresh:', refresh[0].attrib['content']
    x = refresh[0].attrib['content'].find('http')
    url = refresh[0].attrib['content'][x:]
    print 'url:', url

r = requests.get(url)

# final page

print r.text

回答于 2025-04-18 由 Python大师

分享举报

如何处理URL打开器中的重定向

3 个回答

撰写回答