爬虫程序正在获取相对链接

2024-06-17 12:50:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经用scrapy创建了一个爬虫程序。该爬虫程序正在抓取网站并获取链接。 **使用的技术:*Python、Scrapy 错误 爬虫程序正在获取相对URL,因为爬虫程序无法抓取网页。 我希望爬虫程序只获取绝对URL。 请帮忙

import scrapy import os class MySpider(scrapy.Spider): name = 'feed_exporter_test' # this is equivalent to what you would set in settings.py file custom_settings = { 'FEED_FORMAT': 'csv', 'FEED_URI': 'file1.csv' } filePath='file1.csv' if os.path.exists(filePath): os.remove(filePath) else: print("Can not delete the file as it doesn't exists") start_urls = ['https://www.jamoona.com/'] def parse(self, response): titles = response.xpath("//a/@href").extract() for title in titles: yield {'title': title}

Tags: csvinimport程序urlsettingstitleos
1条回答
网友
1楼 · 发布于 2024-06-17 12:50:36

答案是这样的

import scrapy

import os

class MySpider(scrapy.Spider):
    name = 'feed_exporter_test'
    # this is equivalent to what you would set in settings.py file
    custom_settings = {
        'FEED_FORMAT': 'csv',
        'FEED_URI': 'file1.csv'
    }
    filePath = 'file1.csv'
    if os.path.exists(filePath):
        os.remove(filePath)
    else:
        print("Can not delete the file as it doesn't exists")
    start_urls = ['https://www.jamoona.com/']

    def parse(self, response):
        urls = response.xpath("//a/@href").extract()
        for url in urls:
            abs_url = response.urljoin(url)
            yield {'title': abs_url}

相关问题 更多 >