进入下一页并下载所有文件

2024-03-28 19:28:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我是scrapy和python新手,我可以从URL获取详细信息,我想进入链接并下载所有文件(.htm和.txt)。在

我的代码

import scrapy

class legco(scrapy.Spider):
name = "sec_gov"

start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]

def parse(self, response):
    for link in response.xpath('//table[@summary="Results"]//td[@scope="row"]/a/@href').extract():
        absoluteLink = response.urljoin(link)
        yield scrapy.Request(url = absoluteLink, callback = self.parse_page)

def parse_page(self, response):
    for links in response.xpath('//table[@summary="Results"]//a[@id="documentsbutton"]/@href').extract():
        targetLink = response.urljoin(links)
        yield {"links":targetLink}

我需要进入链接并下载所有以.htm和.txt结尾的文件。下面的代码不工作。。在

^{pr2}$

有人能帮我吗?提前谢谢。在


Tags: 文件代码selftxtforparse链接response
1条回答
网友
1楼 · 发布于 2024-03-28 19:28:58

请尝试以下操作将文件下载到桌面或脚本中提到的任何位置:

import scrapy, os

class legco(scrapy.Spider):
    name = "sec_gov"

    start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]

    def parse(self, response):
        for link in response.xpath('//table[@summary="Results"]//td[@scope="row"]/a/@href').extract():
            absoluteLink = response.urljoin(link)
            yield scrapy.Request(url = absoluteLink, callback = self.parse_links)

    def parse_links(self, response):
        for links in response.xpath('//table[@summary="Results"]//a[@id="documentsbutton"]/@href').extract():
            targetLink = response.urljoin(links)
            yield scrapy.Request(url = targetLink, callback = self.collecting_file_links)

    def collecting_file_links(self, response):
        for links in response.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href').extract():
            if links.endswith(".htm") or links.endswith(".txt"):
                baseLink = response.urljoin(links)
                yield scrapy.Request(url = baseLink, callback = self.download_files)

    def download_files(self, response):
        path = response.url.split('/')[-1]
        dirf = r"C:\Users\WCS\Desktop\Storage"
        if not os.path.exists(dirf):os.makedirs(dirf)
        os.chdir(dirf)
        with open(path, 'wb') as f:
            f.write(response.body)

更清楚一点:您需要显式地指定dirf = r"C:\Users\WCS\Desktop\Storage",其中C:\Users\WCS\Desktop或者其他什么是您想要的位置。但是,脚本将自动创建Storage文件夹来保存这些文件。在

相关问题 更多 >