查找网站中不存在的单词

2024-04-28 15:07:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在写一个抓痒蜘蛛,它应该能找到网站内容(文本)中是否存在特定的字符串。我有许多网站(几千个)和许多需要查找的字符串,因此我在代码中使用绑定到变量的列表的原因。有些列表是从其他python文件导入的

我遇到的问题是,尽管在使用开发工具手动检查URL后,我在URL中找不到字符串,但代码似乎产生了积极的“命中”。下面是代码和结果示例

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from list_loop import *
import re
 
word_to_find = 'pharmacy'
 
 
class TestSpider(CrawlSpider):
    name = 'test'
    # these are lists of a lot of domains imported from another
    # file called list_loop.py
    allowed_domains = strip_url
    start_urls = merch_url
 
    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )
 
    def parse_item(self, response):
        # Here I clean up the parsed text not to include /n or whitespace.
        words = response.xpath("//a//text()").getall()
        cleaned_words = [word.strip() for word in words]
        cleaned_words = [word.lower() for word in cleaned_words if len(word) > 0]
 
        # Then I loop through the cleaned_words in order to find a match
        for single_word in cleaned_words:
            re.search(r'\b%s\b' % word_to_find, single_word)
            yield{
                'Matching': 'Found the word {} in {}'.format(word_to_find, response.url)
            }
        else:
            pass

allowed_domainsstart_urls列表中有阿里巴巴.com以及许多其他网站。运行spider后,我得到了这样一个结果输出:

{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},

许多其他网站的内容或HTML中实际上没有“pharmacy”一词,情况也是如此。你知道这里怎么了吗


Tags: thetoinfromhttpsimportcomhttp
1条回答
网友
1楼 · 发布于 2024-04-28 15:07:00

我相信你错过了一个if声明。在您的代码中,无论是否存在匹配项,您都将生成该语句

    for single_word in cleaned_words:
        re.search(r'\b%s\b' % word_to_find, single_word)
        yield{
            'Matching': 'Found the word {} in {}'.format(word_to_find, response.url)
        }

我相信你想要这样的东西:

    for single_word in cleaned_words:
        if re.search(r'\b%s\b' % word_to_find, single_word):
            yield{
                'Matching': 'Found the word {} in {}'.format(word_to_find, response.url)
            }

相关问题 更多 >