如何使用“请求”模块python进行简单的快速请求？

3条回答

网友

1楼 · 编辑于 2024-04-20 12:35:01

通过诸如web报废之类的项目学习Python非常棒。我就是这样被介绍到Python的。也就是说，要提高报废速度，你可以做三件事：

把html解析器改成更快的html.parser“是他们中最慢的。尝试更改为“lxml”或“html5lib”。（读https://www.crummy.com/software/BeautifulSoup/bs4/doc/）

删除循环和正则表达式，因为它们会降低脚本的速度。只需使用beauthoulsoup工具、文本和strip，并找到正确的标记（请参阅下面我的脚本）
由于web报废的瓶颈通常是IO，所以等待从网页获取数据时，使用异步或多线程将提高速度。在下面的脚本中，我使用了多线程。其目的是同时从多个页面提取数据。

所以，如果我们知道最大页数，我们可以将请求分为不同的范围，然后分批提取：）

代码示例：

from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime

import requests
from bs4 import BeautifulSoup as bs

data = defaultdict(list)

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'}

def get_data(data, headers, page=1):

    # Get start time
    start_time = datetime.now()
    url = f'https://www.jobstreet.co.id/en/job-search/job-vacancy/{page}/?src=20&srcr=2000&ojs=6'
    r = requests.get(url, headers=headers)

    # If the requests is fine, proceed
    if r.ok:
        jobs = bs(r.content,'lxml').find('div',{'id':'job_listing_panel'})
        data['title'].extend([i.text.strip() for i in jobs.find_all('div',{'class':'position-title header-text'})])
        data['company'].extend([i.text.strip() for i in jobs.find_all('h3',{'class':'company-name'})])
        data['location'].extend([i['title'] for i in jobs.find_all('li',{'class':'job-location'})] )
        data['desc'].extend([i.text.strip() for i in jobs.find_all('ul',{'class':'list-unstyled hidden-xs '})])
    else:
        print('connection issues')
    print(f'Page: {page} | Time taken {datetime.now()-start_time}')
    return data


def multi_get_data(data,headers,start_page=1,end_page=20,workers=20):
    start_time = datetime.now()
    # Execute our get_data in multiple threads each having a different page number
    with ThreadPoolExecutor(max_workers=workers) as executor:
        [executor.submit(get_data, data=data,headers=headers,page=i) for i in range(start_page,end_page+1)]

    print(f'Page {start_page}-{end_page} | Time take {datetime.now() -     start_time}')
    return data


# Test page 10-15
k = multi_get_data(data,headers,start_page=10,end_page=15)

结果：

解释多个“获取”数据函数：

此函数将在不同的线程中调用get_data函数，并传递所需的参数。现在，每个线程都有一个不同的页码来调用。最大工作线程数设置为20，即20个线程。您可以相应地增加或减少。在

我们已经创建了变量数据，一个默认字典，它接受列表。所有线程都将填充此数据。然后可以将此变量强制转换为json或Pandas DataFrame:）

如您所见，我们有5个请求，每个请求不到2秒，但总数仍不到2秒；）

享受网页刮刮。在

更新日期：2019年12月22日

我们还可以通过使用带有单个标题更新的会话来获得一些速度。所以我们不必每次通话都开始通话。在

^{pr2}$

网友
2楼 · 编辑于 2024-04-20 12:35:01

瓶颈是服务器对简单请求的响应缓慢。在
尝试并行请求。在
您也可以使用线程而不是asyncio。下面是前面的一个问题，解释一下如何在Python中并行处理任务：
Executing tasks in parallel in python
请注意，一个智能配置的服务器仍然会减慢您的请求，或者如果您未经许可进行抓取，则会禁止您。在

网友
3楼 · 编辑于 2024-04-20 12:35:01

这是我的建议：用良好的体系结构编写代码，并将其划分为函数并编写更少的代码。以下是使用请求的示例之一：

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)

在需要时间的地方调试代码，找出它们并在这里讨论。这样可以帮助你解决问题。在

相关问题更多 >

编程相关推荐

热门问题

热门文章