加速Python中的HTTP请求及500错误

0 投票

5 回答

1584 浏览

提问于 2025-04-17 20:00

我有一段代码，可以从这个报纸上获取新闻结果。获取新闻时，我需要输入一个查询和一个时间范围（最多可以是一年）。

结果会分页，每页最多显示10篇文章。因为我找不到增加每页文章数量的方法，所以我需要对每一页发起请求，然后获取每篇文章的标题、网址和日期。每次请求和解析的过程大约需要30秒到1分钟，这样速度非常慢。最后，它会因为响应代码500而停止。我在想有没有办法加快这个过程，或者是否可以同时发起多个请求。我只是想获取所有页面上文章的详细信息。

以下是代码：

    import requests
    import re
    from bs4 import BeautifulSoup
    import csv

    URL = 'http://www.gulf-times.com/AdvanceSearchNews.aspx?Pageindex={index}&keywordtitle={query}&keywordbrief={query}&keywordbody={query}&category=&timeframe=&datefrom={datefrom}&dateTo={dateto}&isTimeFrame=0'


    def run(**params):
        countryFile = open("EgyptDaybyDay.csv","a")
        i=1
        results = True
        while results:
                    params["index"]=str(i)
                    response = requests.get(URL.format(**params))
                    print response.status_code
                    htmlFile = BeautifulSoup(response.content)
                    articles = htmlFile.findAll("div", { "class" : "newslist" })

                    for article in articles:
                                url =  (article.a['href']).encode('utf-8','ignore')
                                title = (article.img['alt']).encode('utf-8','ignore')
                                dateline = article.find("div",{"class": "floatright"})
                                m = re.search("([0-9]{2}\-[0-9]{2}\-[0-9]{4})", dateline.string)
                                date =  m.group(1)
                                w = csv.writer(countryFile,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
                                w.writerow((date, title, url ))

                    if not articles:
                                results = False
                    i+=1
        countryFile.close()


    run(query="Egypt", datefrom="12-01-2010", dateto="12-01-2011")

性能优化错误处理 http请求数据解析网络爬虫响应时间分页处理并发请求

5 个回答

你可以试着让所有的调用都异步进行。

看看这个链接：http://pythonquirks.blogspot.in/2011/04/twisted-asynchronous-http-request.html

你也可以使用gevent，而不是twisted，我只是告诉你有这些选择。

回答于 2025-04-17 由 Python大师

分享举报

最有可能导致速度变慢的原因是服务器，所以让多个http请求同时进行是让代码运行得更快的最佳方法。不过，要加快服务器的响应速度，你能做的事情非常有限。IBM网站上有一个很好的教程，专门讲解如何做到这一点，地址在这里：IBM

回答于 2025-04-17 由 Python大师

分享举报

这是一个很好的机会来试试 gevent。

你应该为请求的部分（request.get）单独写一个程序，这样你的应用就不需要因为输入输出的阻塞而等待。

然后你可以启动多个工作线程，并设置队列来传递请求和文章。可以参考下面的例子：

import gevent.monkey
from gevent.queue import Queue
from gevent import sleep
gevent.monkey.patch_all()

MAX_REQUESTS = 10

requests = Queue(MAX_REQUESTS)
articles = Queue()

mock_responses = range(100)
mock_responses.reverse()

def request():
    print "worker started"
    while True:
        print "request %s" % requests.get()
        sleep(1)

        try:
            articles.put('article response %s' % mock_responses.pop())
        except IndexError:
            articles.put(StopIteration)
            break

def run():
    print "run"

    i = 1
    while True:
        requests.put(i)
        i += 1

if __name__ == '__main__':
    for worker in range(MAX_REQUESTS):
        gevent.spawn(request)

    gevent.spawn(run)
    for article in articles:
        print "Got article: %s" % article

回答于 2025-04-17 由 Python大师

分享举报

加速Python中的HTTP请求及500错误

5 个回答

撰写回答