多请求 PyCurl 永远运行 (无限循环)

2024-05-15 17:02:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我想使用Pycurl执行多请求。代码是: m、 添加_句柄(句柄) 请求.append(手柄,响应)

    # Perform multi-request.
    SELECT_TIMEOUT = 1.0
    num_handles = len(requests)
    while num_handles:
        ret = m.select(SELECT_TIMEOUT)
        if ret == -1: continue
        while 1:
            ret, num_handles = m.perform()
            print "In while loop of multicurl"
            if ret != pycurl.E_CALL_MULTI_PERFORM: break

问题是,这个循环要花很长时间才能运行。它没有终止。 有谁能告诉我,它是做什么的,有什么可能的问题吗?在


Tags: 代码ifrequesttimeout句柄selectmultiperform
2条回答

你查过PyCurl的官方代码了吗?下面的代码实现了multi-stuff,我试着执行它,我能够在300秒内并行地抓取10000个url。你到底想达到什么目的?如果我错了,请纠正我。在

#! /usr/bin/env python
# -*- coding: iso-8859-1 -*-
# vi:ts=4:et
# $Id: retriever-multi.py,v 1.29 2005/07/28 11:04:13 mfx Exp $

#
# Usage: python retriever-multi.py <file with URLs to fetch> [<# of
#          concurrent connections>]
#

import sys
import pycurl

# We should ignore SIGPIPE when using pycurl.NOSIGNAL - see
# the libcurl tutorial for more info.
try:
    import signal
    from signal import SIGPIPE, SIG_IGN
    signal.signal(signal.SIGPIPE, signal.SIG_IGN)
except ImportError:
    pass


# Get args
num_conn = 10
try:
    if sys.argv[1] == "-":
        urls = sys.stdin.readlines()
    else:
        urls = open(sys.argv[1]).readlines()
    if len(sys.argv) >= 3:
        num_conn = int(sys.argv[2])
except:
    print "Usage: %s <file with URLs to fetch> [<# of concurrent connections>]" % sys.argv[0]
    raise SystemExit


# Make a queue with (url, filename) tuples
queue = []
for url in urls:
    url = url.strip()
    if not url or url[0] == "#":
        continue
    filename = "doc_%03d.dat" % (len(queue) + 1)
    queue.append((url, filename))


# Check args
assert queue, "no URLs given"
num_urls = len(queue)
num_conn = min(num_conn, num_urls)
assert 1 <= num_conn <= 10000, "invalid number of concurrent connections"
print "PycURL %s (compiled against 0x%x)" % (pycurl.version, pycurl.COMPILE_LIBCURL_VERSION_NUM)
print "  - Getting", num_urls, "URLs using", num_conn, "connections   -"


# Pre-allocate a list of curl objects
m = pycurl.CurlMulti()
m.handles = []
for i in range(num_conn):
    c = pycurl.Curl()
    c.fp = None
    c.setopt(pycurl.FOLLOWLOCATION, 1)
    c.setopt(pycurl.MAXREDIRS, 5)
    c.setopt(pycurl.CONNECTTIMEOUT, 30)
    c.setopt(pycurl.TIMEOUT, 300)
    c.setopt(pycurl.NOSIGNAL, 1)
    m.handles.append(c)


# Main loop
freelist = m.handles[:]
num_processed = 0
while num_processed < num_urls:
    # If there is an url to process and a free curl object, add to multi stack
    while queue and freelist:
        url, filename = queue.pop(0)
        c = freelist.pop()
        c.fp = open(filename, "wb")
        c.setopt(pycurl.URL, url)
        c.setopt(pycurl.WRITEDATA, c.fp)
        m.add_handle(c)
        # store some info
        c.filename = filename
        c.url = url
    # Run the internal curl state machine for the multi stack
    while 1:
        ret, num_handles = m.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM:
            break
    # Check for curl objects which have terminated, and add them to the freelist
    while 1:
        num_q, ok_list, err_list = m.info_read()
        for c in ok_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            print "Success:", c.filename, c.url, c.getinfo(pycurl.EFFECTIVE_URL)
            freelist.append(c)
        for c, errno, errmsg in err_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            print "Failed: ", c.filename, c.url, errno, errmsg
            freelist.append(c)
        num_processed = num_processed + len(ok_list) + len(err_list)
        if num_q == 0:
            break
    # Currently no more I/O is pending, could do something in the meantime
    # (display a progress bar, etc.).
    # We just call select() to sleep until some more data is available.
    m.select(1.0)


# Cleanup
for c in m.handles:
    if c.fp is not None:
        c.fp.close()
        c.fp = None
    c.close()
m.close()

我想是因为你只会打破第一个while循环

# Perform multi-request.
SELECT_TIMEOUT = 1.0
num_handles = len(requests)
while num_handles:                           #  while nr.1
    ret = m.select(SELECT_TIMEOUT)
    if ret == -1: continue
    while 1:                                 #  while nr.2
        ret, num_handles = m.perform()
        print "In while loop of multicurl"
        if ret != pycurl.E_CALL_MULTI_PERFORM: break
    '**'

所以如果你使用'break'会发生什么,你将打破当前while循环(当你使用break时,你处于第二个whileloop中) 程序的下一步将接受这里写着“**”的行,因为这是它跳回的最后一行。 (指向while num_句柄中的第一行) 再往前走3行,就到了“while1”和soforth。。这就是你得到inf循环的方法。在

因此,要解决这个问题:

^{pr2}$

所以这里发生的是,一旦它脱离嵌套的while循环,它也会自动从第一个循环中中断。 (另外,由于while,以及之前使用的continue,它永远不会到达行

相关问题 更多 >