包含查询参数的刮擦链接提取器

2024-04-26 21:20:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我用刮痧爬多个网站,我真的很喜欢它。 这是一个非常有用的图书馆爬行。你知道吗

首先,我使用LinkExtractor提取我想要的所有URL(产品URL) 然后,对于每个产品,我在表中插入所有产品属性。 我的问题是我想过滤他们的查询参数网址。 但LinkExtractor似乎根本不把查询参数计算在url的一部分。你知道吗

此url应匹配:

https://www.selfridges.com/US/en/cat/jimmy-choo-mavis-85-croc-embossed-leather-knee-high-boots_834-10132-J000123333/?previewAttribute=Dark+green

我注意到这个网站上的所有产品都有“previewAttribute”查询参数。你知道吗

这是我的密码。你知道吗

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from ..items import SelfridgeLinkItem


class SelfridgeSpider(CrawlSpider):
    name = "selfridge_links"
    allowed_domains = ["selfridges.com"]
    start_urls = ["https://www.selfridges.com"]

    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Host": "www.selfridges.com",
        "Cache-Control": "no-cache",
        "Pragma": "no-cache",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0",
        "Cookie": 'utag_main=v_id:016ce747464700149ddb591faf3d0004c002100900bd0$_sn:3$_ss:0$_st:1569926248062$_pn:4%3Bexp-session$ses_id:1569923827386%3Bexp-session; COOKIE_NOTICE_SEEN=seen; _derived_epik=dj0yJnU9LW5EbHU4TFVSUXFCS01rZDFEUGtJa1M2THJlOHJ5a00mbj0tM2R3MG9JbkwtRzktUWVkZ2FNYnFnJm09MSZ0PUFBQUFBRjJUSlZZJnJtPTEmcnQ9QUFBQUFGMlRKVlk; _ga=GA1.2.426420390.1567248313; CoreID6=55737208680715672483137&ci=90262645; _fbp=fb.1.1567248314228.1273159588; SIGNUP_POPUP_SEEN=seen; utag_chan={"channel":"","channel_set":"","channel_converted":false,"awc":""}; SF_COUNTRY_LANG=GB_en; Apache=10.77.3.197.1569919112689966; JSESSIONID=0000m4gdqbi1vRmMyFBuvGcBFUM:17ehj5g7l; WC_PERSISTENT=mrPapd5p%2b6VvDp%2fOn7Hk4BcAA7A%3d%0a%3b2019%2d10%2d01+09%3a58%3a37%2e727%5f1569919112691%2d1318584%5f10052%5f1343329274%2c%2d1%2cGBP%5f10052; WC_SESSION_ESTABLISHED=true; WC_ACTIVEPOINTER=%2d1%2c10052; BIGipServer~S603887-RD2~Test_HTTPS_POOL=rd2o00000000000000000000ffff0a4d03f7o443; _gid=GA1.2.120332267.1569919115; 90262645_clogin=l=83012021569923819640&v=1&e=1569926237805; cmTPSet=Y; WC_AUTHENTICATION_1343329274=1343329274%2cRX3jWqPCrvA4puau5jeTkjjBzjM%3d; WC_USERACTIVITY_1343329274=1343329274%2c10052%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cLxfhSetm0s%2bz%2fPD7mFXwpwwbgz5G8oniY9hro7omuVmDiS5ewUlCKNpqMYwqIu5r%2bgqpAj386E7u%0aym7oLHsfobckYYXVTZ25MIawjQuQJRCn%2fY%2fZ7%2fLbqpXbKHjMpeONS5T21AnchyE%2fFn3f9Y%2f%2bW1L8%0a6GIQ7y%2fJiCYAu8WlqQQRU3SVyd2%2b558VFuhfnZlH0iPkdA%3d%3d; WC_AUTHENTICATION_1343329275=1343329275%2c%2f1kNKbFGbBSctf1m1llODBP5xnc%3d; WC_USERACTIVITY_1343329275=1343329275%2c10052%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cFom9radD%2fau8%2fFrxv5IyWn806MbIKmamxzJKyrlxIZfOW85rIzl0m6zk0vERSuIcAI%2fvo6NlOy53%0atoFXzQR0AJqmdFXNTRzm9cM%2biTVYWpxNvd51rzCyjD44f3LAqEwz7FvcNLbmKCHmNHQ%2fdl%2bMBcLQ%0aCNm9p7GW595jL1VdTt%2b%2fqAGrpxAiUMRTq4Ov%2bMmouMyXfA%3d%3d; AWSELB=85FF15BB10593ECE847219C9B214EEC5BBD393B7301D90E17B625C66620D7473C3FCE779E5EA1D351A2192C6C975C128815AC60F1118B8968E03001896493C045071A25E98; _gat_tealium=1; mmapi.store.p.0=%7B%22mmparams.d%22%3A%7B%7D%2C%22mmparams.p%22%3A%7B%22pd%22%3A%221601460434674%7C%5C%22986242731%7CNgAAAApVAwCBZSYHERJm8AABEgABQgBtBh3bAwB4Zw0iV0bXSFA7o0cALtdIAAAAAP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FAAZEaXJlY3QBMBICAAAAAAAAAAAA%2BbICAPmyAgD5sgIABABoFAEAWp5U91EREgD%2F%2F%2F%2F%2FARESMBL%2F%2FxYAAAEAAAAAAUG3AgBEcQMAACEUAQBWrbZ7PRESAP%2F%2F%2F%2F8BERIwEv%2F%2FFgAAAQAAAAABhLYCAFhwAwAChLcAAAcAAACFtwAACAAAAF4dAQBG85myFjASAP%2F%2F%2F%2F8BMBIwEv%2F%2FBQAAAQAAAAABJs8CAKiTAwAChLcAAAQAAACFtwAABwAAACQLAQCUTkHzZDASAP%2F%2F%2F%2F8BMBIwEv%2F%2FBAAAAQAAAAABVZ8CAHdSAwAChbcAAAYAAACEtwAAAwAAAAMAtQsBAFjECwEANCUfAQCSAAAAAUU%3D%5C%22%22%2C%22bid%22%3A%221569925034300%7C%5C%22ldnvwcgeu08%5C%22%22%2C%22srv%22%3A%221601460434693%7C%5C%22ldnvwcgeu08%5C%22%22%2C%22uat%22%3A%221601460435912%7C%7B%5C%22Gender%5C%22%3A%5C%22Male%5C%22%7D%22%7D%2C%22CX-635%20-%20Mobile%20Search%22%3A%7B%7D%2C%22CXDV-130%20-%20Checkout%20Pre-select%20Next%20Day%20for%20Selfridges%20plus%20customers%22%3A%7B%7D%2C%22mmengine%22%3A%7B%22Integrations%22%3A%221569926234782%7C%7B%5C%22ibm%20analytics%5C%22%3A%7B%5C%22NEW-CX-416-remove-mini-bag-dropdown%5C%22%3A%7B%5C%22sessionDate%5C%22%3A1569924434780%7D%7D%7D%22%7D%2C%22Sticky%20Filters_Run-to-100%22%3A%7B%7D%2C%22NEW-CX-416-remove-mini-bag-dropdown%22%3A%7B%7D%2C%22CXDV-129%20-%20Select%20a%20size%20prompt%22%3A%7B%7D%2C%22CXDV-130-Checkout-Pre-select-Next%20Day-for-Selfridges-plus-customers%22%3A%7B%7D%7D; mmapi.store.s.0=%7B%22mmparams.d%22%3A%7B%7D%2C%22mmparams.p%22%3A%7B%7D%2C%22Sticky%20Filters_Run-to-100%22%3A%7B%22GoogleUniversalExperience%22%3A%220%7C%5C%22element1%3Astickyfilters%5C%22%22%2C%22pushIntegrationsEventTriggered%22%3A%220%7C%7B%5C%22isFirst%5C%22%3Atrue%7D%22%2C%22pushIntegrationsEventReceivedByGoogle%22%3A%220%7Ctrue%22%2C%22pushIntegrationsEventProcessedByGoogle%22%3A%220%7Ctrue%22%2C%22pushIntegrationsInitEventTriggered%22%3A%220%7Cfalse%22%7D%2C%22mmengine%22%3A%7B%22GoogleIntegrationCounter%22%3A%220%7C0%22%2C%22GoogleIntegrationSevars%22%3A%220%7C%5B%5D%22%2C%22GoogleIntegrationData%22%3A%220%7C%7B%7D%22%7D%2C%22CXDV-130-Checkout-Pre-select-Next%20Day-for-Selfridges-plus-customers%22%3A%7B%22pushIntegrationsInitEventTriggered%22%3A%220%7Cfalse%22%7D%2C%22NEW-CX-416-remove-mini-bag-dropdown%22%3A%7B%22GoogleUniversalExperience%22%3A%220%7C%5C%22element1%3Ahide-sbag%5C%22%22%7D%7D'
    }

    # This spider has one rule: extract all (unique and canonicalized) links, follow them and parse them using the parse_items method
    rules = [
        Rule(
            LinkExtractor(
                canonicalize=True,
                unique=True
            ),
            follow=True,
            callback="parse_items"
        )
    ]

    # Method which starts the requests by visiting all URLs specified in start_urls
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, dont_filter=True, headers=self.headers)

    # Method for parsing items
    def parse_items(self, response):
        items = []
        links = LinkExtractor(canonicalize=False, unique=True, allow=r"previewAttribute").extract_links(response)

        for link in links:
            is_allowed = False
            for allowed_domain in self.allowed_domains:
                if allowed_domain in link.url:
                    is_allowed = True

                # if "?previewAttribute=" not in link.url:
                #     is_allowed = False

            if is_allowed:
                item = SelfridgeLinkItem()
                item['url_from'] = response.url
                item['url_to'] = link.url
                items.append(item)
        return items


Tags: inselfcomtrueurlfor产品parse