如何优化这些请求？

def downloader(urls): for i in tqdm(urls): r = requests.get(i, headers=headers, stream=True) if r.status_code == 200: name, path = get_name_path(i) if check_dupe(name) == False: save_file(path, r) folder_path = create_dir() urls = generate_links() downloader(urls)

2条回答

网友

1楼 · 编辑于 2024-04-24 11:38:05

您还可以python ray

您可以按照以下步骤操作：创建n个工人，例如10个工人

worker = 10.

将URL分发到不同的列表（），例如n_数量的工作人员。您可以使用numpy并使用np.arraysplit函数来实现这一点

distributed_urls = np.array_split(url_lists, worker)

从…开始

ray.init(num_cpus = workers)

做射线遥控器

@ray.remote(max_calls=1)
def worker(urls_ls)
  downloader(urls = urls_ls)

all_workers = []
for index, i in enumerate(range(workers)):
  all_workers.append(worker.remote(distributed_urls[index])

ray.get(all_workers)

通过这样做，您可以将工作负载分配给10个不同的工人。您可以根据可用资源分配任意数量的工作人员

您可以在此处查看更多详细信息：https://ray.io/

网友

2楼 · 编辑于 2024-04-24 11:38:05

异步IO/aiohttp解决方案的示例

将URL拆分为100个组，异步请求每个组

注意这一点，因为它会很快处理请求，您可能最终会受到费率限制和/或不受欢迎。（根据api的不同，您可能还必须限制请求的速度）

import asyncio
import itertools
from typing import Any
from typing import Generator
from typing import Iterable
from typing import Sequence
from typing import Tuple

import aiohttp

def grouper(iterable: Iterable[Any], n: int) -> Generator[Tuple[str, ...], None, None]:
    """split an iterable into n-size groups."""
    iterator = iter(iterable)
    while True:
        group = tuple(itertools.islice(iterator, n))
        if not group:
            return
        yield group

async def _do_requests(urls: Sequence[str]) -> None:
    """send get requests to a single group of 10 urls."""
    async with aiohttp.ClientSession() as session:
        for url in urls:
            async with session.get(url) as resp:
                # do stuff with your response
                if resp.status == 200:
                    ...

async def main(urls: Sequence[str]) -> int:
    """break urls up into groups of 10, sending
       get requests asynchronously to each group."""
    await asyncio.gather(*map(_do_requests, grouper(urls, n=100)))
    return 0

if __name__ == '__main__':
    urls = ['https://google.com'] * 1000

    exit_code = asyncio.run(main(urls))
    raise SystemExit(exit_code)

相关问题更多 >

编程相关推荐

热门问题

热门文章